Bug #10132 Crashing the server on corrupt InnoDB page is unhelpful
Submitted: 25 Apr 2005 5:00 Modified: 25 May 2005 21:25
Reporter: James Day
Status: Verified
Category:Server: InnoDB Severity:S4 (Feature request)
Version:5.1, 6.0 OS:Linux (Linux)
Assigned to: Heikki Tuuri Target Version:TBD
Triage: D5 (Feature request)

[25 Apr 2005 5:00] James Day
Description:
When InnoDB encounters a corrupt page it crashes the database server. In doing so, InnoDB
massively magnifies the effect of the problem, changing what is probably a single table
corruption problem to an issue which completely disables every database and table held by
the server. Net result: instead of disabling one table and affecting from under 1% to 55%
of traffic to the site, the chosen course completely removes service from one of the top
100 sites on the net and has similar negative consequences for every server which is not
dedicated to a single database but could instead continue to provide useful service while
the problem is fixed.

How to repeat:
Do anything which results in a checksum failure for an InnoDB page. Operate a place with
"make news headlines if completely unavailable for more than a short time" as its uptime
requirement.

Suggested fix:
Clearly, a corrupt page is a critical error requiring prompt attention from the DBA. The
same applies to MyISAM corruption issues, though marking the table crashed is far less
disruptive. Equally clearly, acting in a way which magnifies the effect of the problem is
highly undesirable for any production environment with any sort of uptime requirement.

To avoid the approach of crashing the server, I suggest providing some way to alert the
major client applications so they can display "Please tell your database administrator
that there is a damaged table which requires urgent repair work" or similar on startup.
Then all engines can use that approach to alert DBAs to critical issues and InnoDB could
mark the table as crashed and/or return a failure response code for the operation instead
of crashing the server and causing a complete loss of service.

Since some InnoDB corruption issues appear transient it could also be helpful for InnoDB
to wait one second and retry the read, then use the normal (not critical) error log to
report a possible hardware or operating system issue.

Not reporting which table is or may be involved is also unhelpful, leaving check table
and another crash as apparently the only way to identify where the problem is. Check
table for 120-180GB is best avoided. Even a half-guess to help to narrow down the problem
would be helpful. The practical effect of this is that immediately commencing a copy from
an intact slave or a restore from backup is far better than trying to identify and fix
the problem area - because complete loss of service while identifying it, without some
assurance of a restored service time, isn't tolerable compared to the certainty of a
restore or copy solution.

Please try very hard to avoid completely killing the database server when there is any
chance of leaving partial service and an opportunity for a desperate DBA to keep some
semblance of service going while the problem is dealt with.

Failing to start is OK until the recovery levels have been tried. At that point, there is
already a service loss. Even in this case, there should be an option to continue to start
and try to survive with as many crashed tables as necessary, so some service can be
retained until a more permanent fix can be arranged. Fail to serve half of all requests
if necessary. That's still better than failing to serve them all.

Severity 2 for this because all workarounds involve unnecessary major loss of service,
particularly when there is a large amount of data to be checked, copied or restored.
[25 Apr 2005 11:21] Heikki Tuuri
James,

nice meeting you at the UC 2005!

Also an undo log page, or even worse, an allocation bitmap page can become corrupt. Then
there is no associated table.

I think InnoDB currently does not know for sure in which index of which table a page is,
unless you use the innodb_file_per_table option.

Marking a corrupt table 'crashed' like in MyISAM, is something we could possibly do.

There is no builtin error return mechanism in InnoDB's B-tree operations. It is a fatal
error if it encounters a corrupt page. But we could mark the table as 'crashed', and
reboot mysqld, so that further queries would be blocked until the DBA marks the table as
ok again.

A couple of years ago, typical Linux file corruption was so severe that InnoDB would
probably seg fault if it accessed a corrupt page at all. And often file corruption
happens on many pages, so that crashes are inevitable.

It may be that corruption is less severe nowadays. As a first step, we could change the
InnoDB behavior and let it use a corrupt page. Hmm... then there is a risk of further
corruption. That is bad :(.

Note also that sometimes rebooting the computer fixes the file corruption, if it is only
present in the file cache of the OS. We should then remove the 'crashed' mark from the
table.

As you see, changing the behavior from the current one has also some downsides. I think
the MyISAM approach is the best. But we must let the DBA to remove the 'crashed' mark
also without running CHECK TABLE, because that can take hours.

The next question is where we can store that 'crashed' bit. The .frm file would be an
easy place. Another possibility is the InnoDB internal data dictionary. If the bit were
set, InnoDB would refuse to open the table at all, unless innodb_force_recovery is used.

The plan:

1. Find out if we can always recognize what table a corrupt page belongs to. If
innodb_file_per_table is used, at least then we know it for sure.

2. Put a bit in the InnoDB internal data dictionary, to mark the corruption of a table.

3. Refuse to open a table where the bit is set.

4. Give the DBA some SQL command to erase that bit.

This is something for 5.1 or 5.2.

Regards,

Heikki
[26 Apr 2005 0:24] James Day
Was good to meet you and others there as well - lots of productive discussions! Was
interesting to note how many people were recommending InnoDB as the engine of choice.
That's what I did as well, for anyone with reliability and availability needs, when asked
during the Wikipedia presentation. The handling of corrupt pages is the biggest problem
with that, which is why you're reading this report.:)

I like your use of the word "currently" for not knowing which table is affected.:) My
first reaction was telling the B-tree engine the type of page, database and table it is
being asked for. Then it can report and may have ways to attempt a partial repair or
crash-proof but "empty" and marked don't save page return, sufficient to allow data to be
dumped. Sadly, that is sure to be desirable for some users, who can be expected not to
have a backup or binlog. 

I wonder if it is possible to read enough data to give a good chance of purging the OS
cache before trying another read of the page? Still need to tell the DBA but it might
work tolerably well in a bad situation. Of course, people would inevitably start
reporting "InnoDB is slow" bugs because of this, without fixing the real cause.:(

The plan seems good. Very big improvement over current version, when bad things are
happening.