Bug #79674 | InnoDB table space corruption prevents MySQL server from starting | ||
---|---|---|---|
Submitted: | 16 Dec 2015 16:08 | Modified: | 17 Jul 2018 7:36 |
Reporter: | Dennis Buteyn | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Server: InnoDB storage engine | Severity: | S1 (Critical) |
Version: | 5.5.46 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | corrupt, crash, innodb |
[16 Dec 2015 16:08]
Dennis Buteyn
[16 Dec 2015 16:09]
Dennis Buteyn
MySQL error log
Attachment: error.log (text/x-log), 52.44 KiB.
[16 Dec 2015 16:10]
Dennis Buteyn
Diagnostic utility output of the affected page
Attachment: page-1225.log (text/x-log), 286 bytes.
[20 Dec 2015 13:34]
Dennis Buteyn
I've finished analysing the ibdata1 file and could not find any problems with it. All records on the specified page are in order, the record directory is also proper. I do not know what is wrong with the code below. It would be very appreciated to get some kind of response as I do not know what further information I need to provide. page = page_align(rec); first_slot = page_dir_get_nth_slot(page, 0); slot = page_dir_get_nth_slot(page, page_dir_get_n_slots(page) - 1); if (page_is_comp(page)) { while (rec_get_n_owned_new(r) == 0) { r = rec_get_next_ptr_const(r, TRUE); // ... } } else { // ... } rec_offs_bytes = mach_encode_2(r - page); while (UNIV_LIKELY(*(uint16*) slot != rec_offs_bytes)) { if (UNIV_UNLIKELY(slot == first_slot)) { // Reaches here and asserts } }
[20 Dec 2015 13:35]
Dennis Buteyn
Detailed output of all records on the affected page
Attachment: records.log (text/x-log), 92.02 KiB.
[20 Dec 2015 13:47]
Dennis Buteyn
I have also found the cause of the discrepancy I found previously in the error log and created a separate bug report for it: http://bugs.mysql.com/bug.php?id=79713
[12 Jan 2016 9:47]
Dennis Buteyn
Bug found found in multiple systems.
[14 Mar 2018 15:04]
MySQL Verification Team
For what it's worth, I found one anomaly on Win64, maybe you can look if similar issue may occur on your system? https://bugs.mysql.com/bug.php?id=81859
[13 Jul 2018 12:29]
MySQL Verification Team
Hi, First , please reply to the question posed by my colleague Shane Bester. Second, switching power off requires that entire system is setup as ACID one, not just MySQL. First of all, all ACID parameters of InnoDB, like O_DIRECT etc, should be used. Next, your disk controller should have its. cache turned off. Your disk must have its cache turned of and your entire filesystem must have caching turned off on Linux. I do not think that this is possible on Linux, but it is possible on Solaris. As I wrote, your ACID settings in InnoDB should be set at maximum security level. You also must try to mount your filesystem without any cache. You should also check out whether your kernel and filesystem support O_DIRECT or any similar settings.
[15 Jul 2018 7:55]
Dennis Buteyn
I do not believe the bug that Shane referenced is related, considering it has preprocessor blocks that are specific for Windows. Moreover the bug I reported specifically states that the file is probably damaged in some way, but not by normal operation. Sinisia, while I understand the requirements for perfect and reliable operation, this is not what this bug report is about. This bug report is about when damage does occurs (after all, we do not live in a perfect world), ALL data is lost AND MySQL enters an unrecoverable state. For our use case, we can live with a couple rows damaged/lost as these rows are most likely the ones changed or added recently. In our case, they just hold statistics and logging information. What we cannot live with is a system that becomes completely inoperable by MySQL entering an infinite restart loop. Requiring that we must use battery backups, disable all write caching (have fun getting ANY performance out of your storage media with that) just sounds like pointing fingers. Yes, the file is corrupt. Yes, it occurred due to non-ideal conditions. Yes, you should probably do all those things if you have really really valuable data. I never denied that this happened under less-than-ideal conditions. However, these requirements do no apply when you do not value your data that much. As stated previously, we do not care if some of it is lost. We do care when ALL of it is lost. We care more when the database service itself is lost. Somehow an assertion was hit. In other words, a fault was detect but there was no fault handling. No amount of ACID-compliance will handle a fault for which no code exists. Claiming you cannot reproduce the conditions of an assertion after two years is insulting. Any competent programmer can glean even from the code posted in my previous comments how to reproduce it. I wish I was lucky enough to get bug reports as detailed as this one, usually I have to work with a screen shot of an error message or less! So let's see how unreproducible this bug is. The first condition states: while (*slot != rec_offs_bytes) where: slot = page_dir_get_nth_slot(page, page_dir_get_n_slots(page) - 1) and: rec_offs_bytes = mach_encode_2(r - page) This appears to be a loop over all slots within some page, attempting to locate some record. Not very interesting. Now for the second condition: if (slot == first_slot) where: first_slot = page_dir_get_nth_slot(page, 0) In other words: If the slot being examined is the first slot of that page, crash and burn. What I think happened is that the page directory is messed up and page_dir_get_nth_slot is returning nonsense. So here is the million dollar question: Is it acceptable for MySQL to fail completely when one or more pages are lost? If the answer is yes, you may want to reconsider your target audience. Bad things happen and stuff does break, no matter how hard you try. The difference lies in how such situations are handled.
[16 Jul 2018 12:20]
MySQL Verification Team
Hi, Yes, we know that we are not living in the perfect world. However, we can not control operating system settings, filesystem settings or the hardware. Corruption that you experience is due to the fact that you have not setup your host for ACID performance. You also might not have set all InnoDB parameters so that loss of data is minimised. As an example of the latter, 5.5 still supports raw disk partitions. Hence, if you would set all OS / hardware parameters properly, disabled Innodb file-per-table option and use raw disk partitions instead of the filesystem(s). Needless to say, if you setup your system and MySQL to be 100 % ACID, your performance will suffer. There is no balance here. You either do frequent backups with a risk to suffer total loss of data, or you have full ACID compatibility with no loss of data. However, as the things are as you describe, there is nothing we can do regarding the amount of the data loss. Anyway, most of the above is described in our Reference Manual, in the chapter about InnoDB.
[17 Jul 2018 7:36]
Dennis Buteyn
I see... So total loss of data and complete denial of service is acceptable due to some flipped bits in the page directory. I cannot argue with that logic. For what its worth, at the time we changed the engine for the table causing the issue to MyISAM and had no further failures. Not only did MyISAM give us a more stable system, performance was improved as well. Thank you for your time, you can close this S1 severity bug (raised after re-reading the definitions) as "working as intended".