MySQL Bugs: #45301: Recovery crash due to fetched page marked free in PIP

Bug #45301	Recovery crash due to fetched page marked free in PIP
Submitted:	3 Jun 2009 12:00	Modified:	26 May 2010 17:47
Reporter:	Olav Sandstå	Email Updates:
Status:	Unsupported	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S2 (Serious)
Version:	6.0.12-alpha	OS:	Linux
Assigned to:	Olav Sandstå	CPU Architecture:	Any
Tags:	F_RECOVERY

Description:
When running the RQG test falcon_pagesize2K recovery fails with the following error:

During recovery, fetched page 5909 tablespace 1 type 9 marked free in PIP

The call stack looks like:

/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Error::debugBreak()+0x12) [0x8569fa8]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Error::error(char const*, ...)+0x7a) [0x856a024]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(IO::readPage(Bdb*)+0x1ec) [0x85d566e]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Cache::fetchPage(Dbb*, int, PageType, LockType)+0x36b) [0x8656e83]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Dbb::fetchPage(int, PageType, LockType)+0x48) [0x85c7a84]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(DataPage::deleteOverflowPages(Dbb*, int, unsigned int)+0x8c) [0x865a87c]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(DataPage::updateRecord(Section*, int, Stream*, unsigned int, bool)+0x85) [0x865b0c1]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Section::updateRecord(int, Stream*, unsigned int, bool)+0x3eb) [0x8625abd]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Dbb::updateRecord(int, int, Stream*, unsigned int, bool)+0x49) [0x85c9fbd]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(SRLUpdateRecords::redo()+0x2a8) [0x861d5cc]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(SerialLog::recover()+0xd79) [0x862dea5]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Database::openDatabase(char const*)+0x341) [0x85c033b]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Connection::getDatabase(char const*, char const*, Threads*)+0x149) [0x85b3239]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Connection::openDatabase(char const*, char const*, char const*, char const*, char const*, Threads*)+0x198) [0x85b4756]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(StorageDatabase::getOpenConnection()+0x86) [0x857850c]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(StorageHandler::initialize()+0x9f) [0x857abf5]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(StorageInterface::falcon_init(void*)+0x256) [0x85613d4]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(ha_initialize_handlerton(st_plugin_int*)+0xa3) [0x840cadf]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld [0x84c479e]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(plugin_init(int*, char**, int)+0x695) [0x84c5d05]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld [0x82c4f31]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(main+0x1de) [0x82c5eea]
/lib/libc.so.6(__libc_start_main+0xdc) [0x736dec]

How to repeat:
Seen once when running falcon_pagesize2K

Parts of the call stack is similar to bug#44744 and bug#45297. Setting this to analyzing until having checked if this is the same as bug#44744 or bug#45297.

Having re-run the recovery I see that the actual crash happens when we try to read the same page that was "marked as free in PIP". This happens since this block is after EOF. The output from the end of the recovery process is:

During recovery, fetched page 5909 tablespace 1 type 9 marked free in PIP
[Falcon] Error: pread on file /home/os136802/mysql/develop/bug/bug45301/a1/vardirs/master-data/falcon_user.fts from  page 5909 (offset 24203264) returned 0 bytes instead of 4096 (possible read behind EOF)

The size of the falcon user table space is 23982080 bytes which corresponds to 5855 blocks. So the block we try to read in about 50 blocks beyond the EOF.

This seems to be a different bug than Bug#44744. In bug#44744 we seems to read invalid data from a data page while in this bug the page number for the overflow page we read from the data page is from within the valid area of the data page and it seems to be a reasonable page number (about 50 beyond EOF).

Setting this bug to verified.

Here is a summary of what happens during recovery when this crash happens:

1. Recovery is processing an UpdateRecords log record (record id 1602). It finds the record in the Record Locator Page which points to DataPage 461.
2. DataPage 461 looks fine (ie. no evidence of curruption). In this page the record's slot says that the record is stored in an overflowPage (with page number 5909). So the first thing the code tires to do is to delete this overflow page (and then the problems starts)
3. The code tries to fetch this page from disk by calling Cache::fetchPage()
4. First indication of something wrong: Cache::fetchPage() checks that this page is marked as used in the PageInventoryPage. This results in the error written:

      "During recovery, fetched page 5909 tablespace 1 type 9 marked free in PIP"

5. The code "corrects" this by calling PageInventoryPage::markPageInUse()
6. Next indication of something wrong: When IO::readPage() tries to read the page it get an EOF error.

Conclusion: The overflow page does not exits in the database file. 

I have also done a scan of the database and found that there are also three other DataPages that refers to OverflowPage that are behind EOF (data pages refer to overflow pages 5902, 5909, 5913, 5929). The serial log (including log going back about three checkpoints) only contains SRLOverflowPages log records where the max overflow page size if 5854 (ie. all are within the existing database file).

As this crash has not been possible to reproduce with the falcon_pagesize2K test here is the theory on what happened leading to the crash:

Just before the crash happened one of the Gophers executed the UpdateRecords log post. This did the following changes to the database buffer:

  1. Updated the main data page for the record.
  2. Allocated a brand new overflow page that at this time did not exists in the database file.
  3. Updated the page inventory page (due to allocating the new overflow page)

The gopher thread also wrote a SRLOverflowPages log record to the serial log which does not get flushed.

Immediately after a checkpoint starts. During the checkpoint MySQL get "killed". At the time it get killed it has written the datapage to disk but not the new overflow page nor the PIP page.

During recovery the SRLOverflowPages log record is not found in the serial log due to not having flushed it, so the PIP is not updated and the overflow page is not allocated during phase 2 of recovery. During phase three when we redo the UpdateRecords log record the crash occures due to we try to delete the overflow page that never got it to the disk before the crash.

By instrumenting the Falcon code I have manged to make a "reproduction" that creates a almost identical crash. Here is what I do (slightly simplified):

1. Run a single transaction that updates 1000 records in a table. Each record consist of a key and a 4100 byte varchar field (initial size of the varchar field is 20 bytes) - so this increases the size of every record by about 4K.

2. When this transaction is executed it produces one UpdateRecords log record and one Commit log record. Serial log is flushed.

3. After the transaction has committed the Gopher executes the UpdateRecords log record. For each record that is updated the Gopher will do:

a) allocate a new OverflowPage
b) write a new SRLOverflowPages log record
c) update the original data page

Note: no flushing of the serial log take place here and no writing of the database buffer takes place. But the size of the database have grown a lot due to allocating a bunch of new OverflowPages.

4. Checkpoint starts: and get KILLED before it has written out all pages.

Note: no flushing has occurred on the SerialLog since the 1000 SRLOverflowPages records got written - these are likely lost.

5. Recovery starts.

6. Stage 2 recovery: no SRLOverflowPages are in the log

7. Stage 3 recovery (REDO): We redo the UpdateRecords log by doing the following:

Section::updateRecord()
DataPage::updateRecord(): if this DataPage was one of the pages that got written to the disk before MySQL got killed it will find that there is an overflow page for this record, so we try to delete that:
DataPage::deleteOverflowPage(): we ask the Cache to fetch the page from disk:
Cache::fetchPage(): ask the IO to read the page from disk
IO::readPage(): if this page was newly allocated (which is likely) and was not written to the disk before MySQL got killed this now leads to a "fatal" read of a block that does not exits on disk - and recovery fails.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

http://lists.mysql.com/commits/78392

2753 Olav Sandstaa 2009-07-10
Fix for Bug#45301 Recovery crash due to fetched page marked free in PIP

This crash occured during recovery due to at the time of the crash the main data page for a record had
written to the data base file while the PIP and the newly allocatedoverflow page for the record had not
been written. The corresponding SRLOverflowPages log record was likely not flushed to the serial log.
This situation happened likely due to MySQL being killed during a Falcon check point.

This is now handled by extending the IO::readPage() so that an exception is thrown in the cases there
is reads beyond end-of-file during recovery. This exception is caught in DataPage::deleteOverflowPages.
If it happens during recovery, DataPage::deleteOverflowPages will handle the exception by marking the
overflow page as free in the Page Inventory Page.

Note that if there are more than one overflow page for this record we might have permanent loss of a few
blocks in the database.

The patch also fixes an missing handling of exceptions in Cache::fetchPage() where an IO excecption would
lead to a BDB having a use count that never got released and which resulted in a crash during shutdown.
@ storage/falcon/Cache.cpp
In Cache::fetchPage: Add missing handling of exceptions that might occur during reading of pages. If an exception was thrown the old code would not do proper clean-up and the allocated BDB would still be allocated and have a use count. This would never be released, take a buffer slot forever and lead to a crash during normal shutdown due to the use count on the BDB. The changed code will handle the exception, release the allocated BDB and rethrow the exception.
@ storage/falcon/DataPage.cpp
Add handling of read errors that occures when trying the access an overflow page that should be deleted during recovery. If an IO error occurs we handle this by ignoring this overflow page and just mark it as free in the Page Inventory Page.
@ storage/falcon/Dbb.cpp
Add a method to be used for checking if we currently is doing recovery on the DBB.
@ storage/falcon/Dbb.h
Add a method to be used for checking if we currently is doing recovery on the DBB.
@ storage/falcon/IO.cpp
Change how a read error that occurs during recovery is handled:
1. Do not set the "fatal error" status of the IO class
2. Throw an exception also if it is read beyond end-of-file (instead of crashing)