Bug #45301 Recovery crash due to fetched page marked free in PIP
Submitted: 3 Jun 2009 12:00 Modified: 26 May 2010 17:47
Reporter: Olav Sandstå Email Updates:
Status: Unsupported Impact on me:
None 
Category:MySQL Server: Falcon storage engine Severity:S2 (Serious)
Version:6.0.12-alpha OS:Linux
Assigned to: Olav Sandstå CPU Architecture:Any
Tags: F_RECOVERY
Triage: Triaged: D1 (Critical)

[3 Jun 2009 12:00] Olav Sandstå
Description:
When running the RQG test falcon_pagesize2K recovery fails with the following error:

During recovery, fetched page 5909 tablespace 1 type 9 marked free in PIP

The call stack looks like:

/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Error::debugBreak()+0x12) [0x8569fa8]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Error::error(char const*, ...)+0x7a) [0x856a024]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(IO::readPage(Bdb*)+0x1ec) [0x85d566e]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Cache::fetchPage(Dbb*, int, PageType, LockType)+0x36b) [0x8656e83]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Dbb::fetchPage(int, PageType, LockType)+0x48) [0x85c7a84]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(DataPage::deleteOverflowPages(Dbb*, int, unsigned int)+0x8c) [0x865a87c]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(DataPage::updateRecord(Section*, int, Stream*, unsigned int, bool)+0x85) [0x865b0c1]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Section::updateRecord(int, Stream*, unsigned int, bool)+0x3eb) [0x8625abd]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Dbb::updateRecord(int, int, Stream*, unsigned int, bool)+0x49) [0x85c9fbd]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(SRLUpdateRecords::redo()+0x2a8) [0x861d5cc]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(SerialLog::recover()+0xd79) [0x862dea5]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Database::openDatabase(char const*)+0x341) [0x85c033b]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Connection::getDatabase(char const*, char const*, Threads*)+0x149) [0x85b3239]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Connection::openDatabase(char const*, char const*, char const*, char const*, char const*, Threads*)+0x198) [0x85b4756]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(StorageDatabase::getOpenConnection()+0x86) [0x857850c]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(StorageHandler::initialize()+0x9f) [0x857abf5]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(StorageInterface::falcon_init(void*)+0x256) [0x85613d4]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(ha_initialize_handlerton(st_plugin_int*)+0xa3) [0x840cadf]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld [0x84c479e]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(plugin_init(int*, char**, int)+0x695) [0x84c5d05]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld [0x82c4f31]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(main+0x1de) [0x82c5eea]
/lib/libc.so.6(__libc_start_main+0xdc) [0x736dec]

How to repeat:
Seen once when running falcon_pagesize2K
[3 Jun 2009 12:04] Olav Sandstå
Parts of the call stack is similar to bug#44744 and bug#45297. Setting this to analyzing until having checked if this is the same as bug#44744 or bug#45297.
[5 Jun 2009 13:45] Olav Sandstå
Having re-run the recovery I see that the actual crash happens when we try to read the same page that was "marked as free in PIP". This happens since this block is after EOF. The output from the end of the recovery process is:

During recovery, fetched page 5909 tablespace 1 type 9 marked free in PIP
[Falcon] Error: pread on file /home/os136802/mysql/develop/bug/bug45301/a1/vardirs/master-data/falcon_user.fts from  page 5909 (offset 24203264) returned 0 bytes instead of 4096 (possible read behind EOF)

The size of the falcon user table space is 23982080 bytes which corresponds to 5855 blocks. So the block we try to read in about 50 blocks beyond the EOF.
[5 Jun 2009 14:06] Olav Sandstå
This seems to be a different bug than Bug#44744. In bug#44744 we seems to read invalid data from a data page while in this bug the page number for the overflow page we read from the data page is from within the valid area of the data page and it seems to be a reasonable page number (about 50 beyond EOF).

Setting this bug to verified.
[10 Jul 2009 12:07] Olav Sandstå
Here is a summary of what happens during recovery when this crash happens:

1. Recovery is processing an UpdateRecords log record (record id 1602). It finds the record in the Record Locator Page which points to DataPage 461.
2. DataPage 461 looks fine (ie. no evidence of curruption). In this page the record's slot says that the record is stored in an overflowPage (with page number 5909). So the first thing the code tires to do is to delete this overflow page (and then the problems starts)
3. The code tries to fetch this page from disk by calling Cache::fetchPage()
4. First indication of something wrong: Cache::fetchPage() checks that this page is marked as used in the PageInventoryPage. This results in the error written:

      "During recovery, fetched page 5909 tablespace 1 type 9 marked free in PIP"

5. The code "corrects" this by calling PageInventoryPage::markPageInUse()
6. Next indication of something wrong: When IO::readPage() tries to read the page it get an EOF error.

Conclusion: The overflow page does not exits in the database file. 

I have also done a scan of the database and found that there are also three other DataPages that refers to OverflowPage that are behind EOF (data pages refer to overflow pages 5902, 5909, 5913, 5929). The serial log (including log going back about three checkpoints) only contains SRLOverflowPages log records where the max overflow page size if 5854 (ie. all are within the existing database file).
[10 Jul 2009 12:16] Olav Sandstå
As this crash has not been possible to reproduce with the falcon_pagesize2K test here is the theory on what happened leading to the crash:

Just before the crash happened one of the Gophers executed the UpdateRecords log post. This did the following changes to the database buffer:

  1. Updated the main data page for the record.
  2. Allocated a brand new overflow page that at this time did not exists in the database file.
  3. Updated the page inventory page (due to allocating the new overflow page)

The gopher thread also wrote a SRLOverflowPages log record to the serial log which does not get flushed.

Immediately after a checkpoint starts. During the checkpoint MySQL get "killed". At the time it get killed it has written the datapage to disk but not the new overflow page nor the PIP page.

During recovery the SRLOverflowPages log record is not found in the serial log due to not having flushed it, so the PIP is not updated and the overflow page is not allocated during phase 2 of recovery. During phase three when we redo the UpdateRecords log record the crash occures due to we try to delete the overflow page that never got it to the disk before the crash.
[10 Jul 2009 12:19] Olav Sandstå
By instrumenting the Falcon code I have manged to make a "reproduction" that creates a almost identical crash. Here is what I do (slightly simplified):

1. Run a single transaction that updates 1000 records in a table. Each record consist of a key and a 4100 byte varchar field (initial size of the varchar field is 20 bytes) - so this increases the size of every record by about 4K.

2. When this transaction is executed it produces one UpdateRecords log record and one Commit log record. Serial log is flushed.

3. After the transaction has committed the Gopher executes the UpdateRecords log record. For each record that is updated the Gopher will do:

    a) allocate a new OverflowPage
    b) write a new SRLOverflowPages log record
    c) update the original data page

   Note: no flushing of the serial log take place here and no writing of the database buffer  takes place. But the size of the database have grown a lot due to allocating  a bunch of new OverflowPages.

4. Checkpoint  starts:  and  get KILLED before it has written out all  pages.

   Note: no flushing has occurred on the SerialLog since the 1000 SRLOverflowPages records got written - these are likely lost.

5. Recovery starts.

6.  Stage 2 recovery: no SRLOverflowPages are in the log

7. Stage 3 recovery (REDO): We redo the UpdateRecords log by doing the following:

      Section::updateRecord()
      DataPage::updateRecord(): if this DataPage was one of the pages that got written to the disk before MySQL got killed it will find that there is an overflow page for this record, so we try to delete that:
      DataPage::deleteOverflowPage(): we ask the Cache to fetch the page from disk:
      Cache::fetchPage(): ask the IO to read the page from disk
      IO::readPage(): if this page was newly allocated (which is likely) and was not written to the disk before MySQL got killed this now leads to a "fatal" read of a block that does not exits on disk - and recovery fails.
[10 Jul 2009 12:52] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/78392

2753 Olav Sandstaa	2009-07-10
      Fix for Bug#45301 Recovery crash due to fetched page marked free in PIP
      
      This crash occured during recovery due to at the time of the crash the main data page for a record had
      written to the data base file while the PIP and the newly allocatedoverflow page for the record had not 
      been written. The corresponding SRLOverflowPages log record was likely not flushed to the serial log.
      This situation happened likely due to MySQL being killed during a Falcon check point.
      
      This is now handled by extending the IO::readPage() so that an exception is thrown in the cases there
      is reads beyond end-of-file during recovery. This exception is caught in DataPage::deleteOverflowPages.
      If it happens during recovery, DataPage::deleteOverflowPages will handle the exception by marking the
      overflow page as free in the Page Inventory Page.
      
      Note that if there are more than one overflow page for this record we might have permanent loss of a few
      blocks in the database.
      
      The patch also fixes an missing handling of exceptions in Cache::fetchPage() where an IO excecption would
      lead to a BDB having a use count that never got released and which resulted in a crash during shutdown.
     @ storage/falcon/Cache.cpp
        In Cache::fetchPage: Add missing handling of exceptions that might occur during reading of pages. If an exception was thrown the old code would not do proper clean-up and the allocated BDB would still be allocated and have a use count. This would never be released, take a buffer slot forever and lead to a crash during normal shutdown due to the use count on the BDB. The changed code will handle the exception, release the allocated BDB and rethrow the exception.
     @ storage/falcon/DataPage.cpp
        Add handling of read errors that occures when trying the access an overflow page that should be deleted during recovery. If an IO error occurs we handle this by ignoring this overflow page and just mark it as free in the Page Inventory Page.
     @ storage/falcon/Dbb.cpp
        Add a method to be used for checking if we currently is doing recovery on the DBB.
     @ storage/falcon/Dbb.h
        Add a method to be used for checking if we currently is doing recovery on the DBB.
     @ storage/falcon/IO.cpp
        Change how a read error that occurs during recovery is handled:
        1. Do not set the "fatal error" status of the IO class
        2. Throw an exception also if it is read beyond end-of-file (instead of crashing)