Bug #28351 | Falcon: Crash recovery failure if the crash is during an alter table. | ||
---|---|---|---|
Submitted: | 10 May 2007 13:53 | Modified: | 31 Aug 2007 12:29 |
Reporter: | Peter Gulutzan | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Falcon storage engine | Severity: | S3 (Non-critical) |
Version: | 6.0.1-alpha-debug | OS: | Linux (SUSE 10 64-bit) |
Assigned to: | Kevin Lewis | CPU Architecture: | Any |
[10 May 2007 13:53]
Peter Gulutzan
[11 May 2007 18:38]
MySQL Verification Team
Thank you for the bug report. Verified on FC 6.0 32-bit. (gdb) bt full #0 0x006c4402 in __kernel_vsyscall () No symbol table info available. #1 0x00bb40d1 in raise () from /lib/libpthread.so.0 No symbol table info available. #2 0x08512008 in Error::debugBreak () at Error.cpp:92 No locals. #3 0x08512069 in Error::error (string=0x890050c "read error on page %d of \"%s\": %s (%d)") at Error.cpp:69 buffer = "read error on page 8913152 of \"/home/miguel/dbs/5.1f/var/test.fts\": No such file or directory (2)\000\000\000````\002```\000\000\000\000````ô¿¤\000\002\000\000\000\b¥Å\n\224Àx´#a\223\000\021$£\000Ô%£", '\0' <repeats 13 times>, "\005\000\000\000ô¿¤\000ÔÀx´9ô\227\000\021$£\000Ô%£\000\005\000\000\000àÀx´à:\220\000\200\b\221\000HÎý·\001\000\000"... args = 0xb478c104 "" #4 0x0852abbf in IO::readPage (this=0xb7c062dc, bdb=0xb7f3aabc) at IO.cpp:168 <cut>
[16 May 2007 17:36]
Kevin Lewis
I can reproduce this easily on Windows. debugging...
[17 May 2007 3:40]
Kevin Lewis
I have reproduced a crash during recovery every time I have killed the engine while running this stored procedure. The procedure creates a file with 20 records. Then in a loop, it updates each record with random data, and does two alter tables. Most of the time is spent in the alter table code, so that every time I crash it, the call stack is somewhere in the alter table code. The error 1025 does not happen in my tests and has nothing to do with the failure to recover. The recovery crash happens in various places; Sometimes the page read is the wrong page type, sometimes the serial log record is invalid, yeilding wrong page numbers which cause asserts later. I am renaming the synopsis to reflect the true situation and raising the priority since this seems to indicate that a Falcon database is unrecoverable if the crash occurs during an alter table.
[24 May 2007 0:18]
Kevin Lewis
After several days of improving debug logging of the serial log so that I could see what was in it, I finally have the information I need to understand why this problem occurs. I have tracked it down to a difference between Section::insertStub() and Section::reInsertStub(). Section::insertStub takes into account two levels of section pages, the lower level is filled with RecordIndex entries and has 510 slots in a 4k page. The upper level is finnled with four-byte page numbers and contains 1020 slots in a 4k page. Section::reInsertStub() however seems to think it is inserting directly into a lower level SectionPage. It does not handle the upper level. Section::reInsertStub() is only called during recovery and happens after a section page is 'bumped' to a new incarnation during a drop table, which always happens during an AlterTable. Whenever I kill this test, it is always during an alter table since the 2 AlterTables take much longer than the updates in the test. So I need to make sure Section::reInsertStub() works on the same SectionPage structure as Section::insertStub()
[24 May 2007 4:17]
Kevin Lewis
Well, the last synopsys of the problem was not correct! Section::reInsertStub is working correctly(). It turns out that the upper level section page slot which contains the page number of the lower level section page (the RecordLocatorPage) was set to zero after the first record in the transaction was redone. That is because Section::storeRecord() decided that; /* Record doesn't fit on last used data page, make another */ So during recovery, the records are put onto different pages than originally done, and the section pages can have different contents. This bug may have something to do with the test using very large varchar fields. Still looking for the smoking gun, but I smell gunpowder...
[1 Jun 2007 6:45]
Kevin Lewis
Pushed a set of changes that prevent this crash from occurring. The crash was due to a section page that was redone during recovery, and then allocated as another data page. The reason it was re-allocated is because the page was marked as a free page in the PageInventoryPage. The problem was that Section::redoSectionPage() is called in pass 2 of the recovery, but SerialLog::redoFreePage() is called in pass 3 (the redo pass). The the section page was redone before it was freed. This changeset calls SerialLog::redoFreePage() in pass 2 instead of pass 3 and adds PageInventoryPage::markPageInUse() to Section::redoSectionPage() so that the page is no longer marked free. There are still problems encountered when Falcon is crashed during an alter table. I have seen both missing tables and missing indexes after the recovery. But these are mentioned in a similar bug (Bug#25017). Let's let this bug track the crash that occurs during recovery of a crashed alter table, and let Bug#25017 track missing indexes and tables. Another part of this changeset allows prevents a crash from occurring when the index is missing after recovery. The missing index is still a problem, which will be tracked by Bug#25017 If not, another alloc make use that page thinking it is free.
[29 Aug 2007 18:21]
Peter Gulutzan
I confirm that I can no longer reproduce the bug behaviour.
[31 Aug 2007 12:29]
MC Brown
A note has been added to the 6.0.2 changelog: Under certain situations the Falcon tables and log could become corrupt and prevent recovery from a crashed version of the files.