MySQL Bugs: #10132: Crashing the server on corrupt InnoDB page is unhelpful

Bug #10132	Crashing the server on corrupt InnoDB page is unhelpful
Submitted:	25 Apr 2005 3:00	Modified:	23 Jan 2014 12:37
Reporter:	James Day	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S1 (Critical)
Version:	5.1, 6.0	OS:	Linux (Linux)
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
When InnoDB encounters a corrupt page it crashes the database server. In doing so, InnoDB massively magnifies the effect of the problem, changing what is probably a single table corruption problem to an issue which completely disables every database and table held by the server. Net result: instead of disabling one table and affecting from under 1% to 55% of traffic to the site, the chosen course completely removes service from one of the top 100 sites on the net and has similar negative consequences for every server which is not dedicated to a single database but could instead continue to provide useful service while the problem is fixed.

How to repeat:
Do anything which results in a checksum failure for an InnoDB page. Operate a place with "make news headlines if completely unavailable for more than a short time" as its uptime requirement.

Suggested fix:
Clearly, a corrupt page is a critical error requiring prompt attention from the DBA. The same applies to MyISAM corruption issues, though marking the table crashed is far less disruptive. Equally clearly, acting in a way which magnifies the effect of the problem is highly undesirable for any production environment with any sort of uptime requirement.

To avoid the approach of crashing the server, I suggest providing some way to alert the major client applications so they can display "Please tell your database administrator that there is a damaged table which requires urgent repair work" or similar on startup. Then all engines can use that approach to alert DBAs to critical issues and InnoDB could mark the table as crashed and/or return a failure response code for the operation instead of crashing the server and causing a complete loss of service.

Since some InnoDB corruption issues appear transient it could also be helpful for InnoDB to wait one second and retry the read, then use the normal (not critical) error log to report a possible hardware or operating system issue.

Not reporting which table is or may be involved is also unhelpful, leaving check table and another crash as apparently the only way to identify where the problem is. Check table for 120-180GB is best avoided. Even a half-guess to help to narrow down the problem would be helpful. The practical effect of this is that immediately commencing a copy from an intact slave or a restore from backup is far better than trying to identify and fix the problem area - because complete loss of service while identifying it, without some assurance of a restored service time, isn't tolerable compared to the certainty of a restore or copy solution.

Please try very hard to avoid completely killing the database server when there is any chance of leaving partial service and an opportunity for a desperate DBA to keep some semblance of service going while the problem is dealt with.

Failing to start is OK until the recovery levels have been tried. At that point, there is already a service loss. Even in this case, there should be an option to continue to start and try to survive with as many crashed tables as necessary, so some service can be retained until a more permanent fix can be arranged. Fail to serve half of all requests if necessary. That's still better than failing to serve them all.

Severity 2 for this because all workarounds involve unnecessary major loss of service, particularly when there is a large amount of data to be checked, copied or restored.

James,

nice meeting you at the UC 2005!

Also an undo log page, or even worse, an allocation bitmap page can become corrupt. Then there is no associated table.

I think InnoDB currently does not know for sure in which index of which table a page is, unless you use the innodb_file_per_table option.

Marking a corrupt table 'crashed' like in MyISAM, is something we could possibly do.

There is no builtin error return mechanism in InnoDB's B-tree operations. It is a fatal error if it encounters a corrupt page. But we could mark the table as 'crashed', and reboot mysqld, so that further queries would be blocked until the DBA marks the table as ok again.

A couple of years ago, typical Linux file corruption was so severe that InnoDB would probably seg fault if it accessed a corrupt page at all. And often file corruption happens on many pages, so that crashes are inevitable.

It may be that corruption is less severe nowadays. As a first step, we could change the InnoDB behavior and let it use a corrupt page. Hmm... then there is a risk of further corruption. That is bad :(.

Note also that sometimes rebooting the computer fixes the file corruption, if it is only present in the file cache of the OS. We should then remove the 'crashed' mark from the table.

As you see, changing the behavior from the current one has also some downsides. I think the MyISAM approach is the best. But we must let the DBA to remove the 'crashed' mark also without running CHECK TABLE, because that can take hours.

The next question is where we can store that 'crashed' bit. The .frm file would be an easy place. Another possibility is the InnoDB internal data dictionary. If the bit were set, InnoDB would refuse to open the table at all, unless innodb_force_recovery is used.

The plan:

1. Find out if we can always recognize what table a corrupt page belongs to. If innodb_file_per_table is used, at least then we know it for sure.

2. Put a bit in the InnoDB internal data dictionary, to mark the corruption of a table.

3. Refuse to open a table where the bit is set.

4. Give the DBA some SQL command to erase that bit.

This is something for 5.1 or 5.2.

Regards,

Heikki

Was good to meet you and others there as well - lots of productive discussions! Was interesting to note how many people were recommending InnoDB as the engine of choice. That's what I did as well, for anyone with reliability and availability needs, when asked during the Wikipedia presentation. The handling of corrupt pages is the biggest problem with that, which is why you're reading this report.:)

I like your use of the word "currently" for not knowing which table is affected.:) My first reaction was telling the B-tree engine the type of page, database and table it is being asked for. Then it can report and may have ways to attempt a partial repair or crash-proof but "empty" and marked don't save page return, sufficient to allow data to be dumped. Sadly, that is sure to be desirable for some users, who can be expected not to have a backup or binlog. 

I wonder if it is possible to read enough data to give a good chance of purging the OS cache before trying another read of the page? Still need to tell the DBA but it might work tolerably well in a bad situation. Of course, people would inevitably start reporting "InnoDB is slow" bugs because of this, without fixing the real cause.:(

The plan seems good. Very big improvement over current version, when bad things are happening.

One useful option here would be to skip damaged pages that are in tables. Is that viable? Then we could export all readable pages in a table during disaster recovery. That's what we try to do today anyway, we just have to jump through hoops (ORDER BY PK DESC or secondary keys) to do it instead of using simple mysqldump.

Or perhaps a save and erase page command, saves page to disk, replaces all contents with an empty page so no error is produced.

Bug #56373 is a duplicate (special case) of this one.

see bug #59156

I believe that this bug has to be fixed before Bug #59385 (mysqld crashes in various ways when starting to run into OS open files limit) can be fixed.

As Bug #59385 was marked a duplicate of this one, please make sure that when this bug (#10132) is fixed, any issues reported in Bug #59385 are fixed also. If not, please re-open that bug.

Bug #65485 was marked as duplicate of this one.

Which bug is this a duplicate of? Will it ever be fixed?

Sorry, I think that this bug was incorrectly closed as a duplicate.

Jimmy Yang fixed an internally filed bug that implemented some infrastructure that would allow us to flag a table or index corrupted.
That was a prerequisite for fixing this bug, but it did not fix this bug.

When we read a corrupted page, we currently simply crash the server. Instead, we should propagate the error all way up to the actual client connection. If it was an index page, we should mark the index tree corrupted. If it was a page in the clustered index, we should mark the entire table corrupted. If it was some other page, we should maybe return some warning and mark the whole tablespace file read-only. It could also make sense to make the file (or the whole database instance) read-only. This choice (go read-only, abort, or try to carry on read-write) could be controlled by some configuration parameter.

Likewise, when writing to a file, we have different options. If a redo log write fails, we probably should not crash, but instead try to flush the dirty pages from the buffer pool, so that the database can be recovered without the corrupted redo log.

On top of this there are some "internal" issues discussed in Bug#59385: what we should do when the InnoDB internal bookkeeping says that we have run out of space in a fixed-size tablespace, or if the internal bookkeeping is corrupted.

As you can see, a complete fix will be a huge amount of work. Unfortunately, I cannot promise a complete fix any time soon.

Marko, great ideas. This rocks. Looking forward to a day when all this is added. This sort of feature would definitely be in my "Top 3 most wanted features" vote. I expect many others to think likewise. Can this be blueprinted (maybe even publicly), community feedback be sought, and then triaged for implementation?

A duplicate bug was reported:
Bug#72645 INNODB SHOULD ABORT WHEN A PAGE IN SYSTEM, UNDO OR TEMP TABLESPACE IS CORRUPTTED

There's really no one perfect way to respond to this because the desired response depends on the environment:

1: highly redundant systems. These places will tend to want an immediate total failure because their strategy tends to be to replace or reclone a slave.

2: less highly redundant or shared hosting. These places will tend to want to continue limited service to the maximum extent possible.

The first set of places may have code to handle a propagated failure but probably won't care. The second set probably won't have very good error handling code, in general.

The kill the server immediately and use innodb_force_recovery modes is the way to go for set 1, for set 2, not. So a flag to pick between the two ways is desirable if we're to handle the range of user needs.

James Day, MySQL Senior Principal Support Engineer, Oracle