Bug #43656 | InnoDB data file becoming corrupt | ||
---|---|---|---|
Submitted: | 14 Mar 2009 20:14 | Modified: | 22 Dec 2009 16:55 |
Reporter: | Alex Rickabaugh | Email Updates: | |
Status: | No Feedback | Impact on me: | |
Category: | MySQL Server: InnoDB storage engine | Severity: | S2 (Serious) |
Version: | 5.1.31 | OS: | Linux (2.6.18-53.1.21.el5 (CentOS 5)) |
Assigned to: | CPU Architecture: | Any | |
Tags: | corruption, innodb, production |
[14 Mar 2009 20:14]
Alex Rickabaugh
[14 Mar 2009 20:18]
Alex Rickabaugh
Here is an example from the error log of db2 of corruption causing the server to restart. In this case, the restart was transparent and replication resumed without intervention being necessary. 090313 0:52:19 InnoDB: Page checksum 3491221571, prior-to-4.0.14-form checksum 2702929697 InnoDB: stored checksum 1343051649, prior-to-4.0.14-form stored checksum 2702929697 InnoDB: Page lsn 9 2618135786, low 4 bytes of lsn at page end 2618135786 InnoDB: Page number (if stored to page already) 1292485, InnoDB: space id (if created with >= MySQL-4.1.1 and stored already) 0 InnoDB: Page may be an index page where index id is 0 412 InnoDB: (index "PRIMARY" of table "subeta_subeta"."logs_user_items") InnoDB: Database page corruption on disk or a failed InnoDB: file read of page 1292485. InnoDB: You may have to recover from a backup. InnoDB: It is also possible that your operating InnoDB: system has corrupted its own file cache InnoDB: and rebooting your computer removes the InnoDB: error. InnoDB: If the corrupt page is an index page InnoDB: you can also try to fix the corruption InnoDB: by dumping, dropping, and reimporting InnoDB: the corrupt table. You can use CHECK InnoDB: TABLE to scan your table for corruption. InnoDB: See also http://dev.mysql.com/doc/refman/5.1/en/forcing-recovery.html InnoDB: about forcing recovery. InnoDB: Ending processing because of a corrupt database page.
[16 Mar 2009 6:40]
Sveta Smirnova
Thank you for the report. > at which point an automatic monitoring script restarts mysqld on db1 How this script restarts mysqld? What does it use: mysql.server script or?
[16 Mar 2009 13:04]
Alex Rickabaugh
When it detects that MySQL has stopped responding to simple queries efficiently, it first attempts a graceful restart (/etc/init.d/mysql restart). If this restart times out (the init.d script prints "gave up on waiting for mysqld to exit", it sends SIGKILL to the mysqld and mysqld_safe processes, and waits a few seconds before executing /etc/init.d/mysql start. This monitoring script only runs on db1, the master, however. The slave server is never restarted except through /etc/init.d/mysql restart, and it also experiences corruption. I would also assume that SIGKILL terminates the process much like a loss of power to the server would, and that the InnoDB data file should never be in an inconsistent state as a result.
[17 Mar 2009 9:29]
Sveta Smirnova
Thank you for the feedback. Regarding to SIGKILL it is expected what you experience corruption problems. Try Forcing InnoDB Recovery to restore database to working state: http://dev.mysql.com/doc/refman/5.0/en/forcing-recovery.html You said: > This monitoring script only runs on db1, the master, however. The slave server is never > restarted except through /etc/init.d/mysql restart, and it also experiences corruption. What slave corrupts is not good and can be MySQL bug. Please provide full error log from slave.
[25 Mar 2009 3:33]
Alex Rickabaugh
Hello Sveta, all. Decided to do a clean test of this and provide the log file when we notice corruption. Tomorrow I will be dumping our database table-by-table and importing it into MySQL. I'll be doing this on our slave server, which I will then have connect to the master and catch up on replication of the latest events. Then, I will monitor the slave server and as soon as I detect that InnoDB's data file has become corrupted, I will capture a complete snapshot of it, and save it for your examination.
[25 Mar 2009 7:16]
Sveta Smirnova
Thank you for the update. We will wait results of tests from you.
[6 Apr 2009 23:42]
Alex Rickabaugh
Hello again. Our slave server crashed yesterday, apparently as a result of corruption. The crash was triggered during the process of a routine hot mysqldump of the database. I have saved the error logs, ibdata1, iblog files, master.info, and the replication logs from the master (in the process of infinitely restarting, the server created tens of thousands of relay log files, so those ended up being purged). Please note that the slave server was stable and never killed or crashed (always gracefully restarted) until the corruption caused it to crash. I can arrange for MySQL developers to access this data if it will be helpful in tracking down the source of our corruption on the slave. I will also be at the MySQL Conference later this month if anyone wants to discuss this in person. -Alex
[7 Apr 2009 6:26]
Sveta Smirnova
Thank you for the feedback. Yes, would be good if you are able to provide access to your files. Alternatively you can upload them as an archive to our FTP server. See "Files" tab for instructions.
[13 Apr 2009 14:20]
Mikhail Izioumtchenko
could you tell with which version of MySQL the backup set, or the original dataset, was created?
[13 Apr 2009 14:43]
Alex Rickabaugh
The backup set was imported with mysql 5.1.25-rc.
[13 Apr 2009 19:01]
Mikhail Izioumtchenko
>Because it occurs in both servers we've decided that hardware issues are most likely not the cause. actually a strong case can be made for the exact opposite as well,considering those are identical servers. I'd suggest looking into SSDs as it's a relatively new technology. A few kinds of tests could be performed: 1. try to break MySQL/InnoDB using your workload and configuration but with regular HD. 2. try to break your SSDs by using your workload if possible, or similar, with MyISAM, or just stress tests the SSDs using regular filesystem. 3. Another area to look it is the RAID controller. What if you disable RAID altogether and see if the corruption happens with just plain SSD or HD. Regarding your question about kill -9, it's milder than server powerdown. kill -9 only kills the process but the OS and the hardware are supposed to do the cleanup: finish out pending writes, close files etc.
[14 Apr 2009 23:53]
Alex Rickabaugh
Michael, You may have a point here. I am not sure which model of SSDs are used in our servers (they are dedicated hosting) but I agree that it's possible that the disks or RAID controller (or combination) are causing different corruption on both systems. As far as trying to crash goes, we don't really have to try, it just happens. I'm actually having to catch up a backup copy of our database that was imported from scratch again because both master and slave databases will crash the server within 15 seconds when loaded. I agree that this needs to be tested, though. We're going to create an Amazon EC2 instance server and stream replication logs to it, and see if it experiences the same issues as our local slave server. If it works, we'll both know what the problem is and have a constant, clean backup. :) I'll let you know when the results are in. -Alex
[22 Nov 2009 16:55]
Valeriy Kravchuk
Do you have any further comments on this bug and possible reasons?
[23 Dec 2009 0:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".