Bug #45301 Recovery crash due to fetched page marked free in PIP
Submitted: 3 Jun 2009 12:00 Modified: 26 May 2010 17:47
Reporter: Olav Sandstaa Email Updates:
Status: Unsupported
Category:Server: Falcon Severity:S2 (Serious)
Version:6.0.12-alpha OS:Linux
Assigned to: Olav Sandstaa Target Version:
Tags: F_RECOVERY
Triage: Triaged: D1 (Critical)

[3 Jun 2009 12:00] Olav Sandstaa
Description:
When running the RQG test falcon_pagesize2K recovery fails with the following error:

During recovery, fetched page 5909 tablespace 1 type 9 marked free in PIP

The call stack looks like:

/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Error::debugBreak()+0x12)
[0x8569fa8]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Error::error(char
const*, ...)+0x7a) [0x856a024]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(IO::readPage(Bdb*)+0x1ec)
[0x85d566e]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Cache::fetchPage(Dbb*,
int, PageType, LockType)+0x36b) [0x8656e83]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Dbb::fetchPage(int,
PageType, LockType)+0x48) [0x85c7a84]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(DataPage::deleteOverflowPages(Dbb*,
int, unsigned int)+0x8c) [0x865a87c]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(DataPage::updateRecord(Section*,
int, Stream*, unsigned int, bool)+0x85) [0x865b0c1]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Section::updateRecord(int,
Stream*, unsigned int, bool)+0x3eb) [0x8625abd]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Dbb::updateRecord(int,
int, Stream*, unsigned int, bool)+0x49) [0x85c9fbd]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(SRLUpdateRecords::redo()+0x2a8)
[0x861d5cc]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(SerialLog::recover()+0xd79)
[0x862dea5]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Database::openDatabase(char
const*)+0x341) [0x85c033b]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Connection::getDatabase(char
const*, char const*, Threads*)+0x149) [0x85b3239]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(Connection::openDatabase(char
const*, char const*, char const*, char const*, char const*, Threads*)+0x198) [0x85b4756]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(StorageDatabase::getOpenConnection()+0x86)
[0x857850c]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(StorageHandler::initialize()+0x9f)
[0x857abf5]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(StorageInterface::falcon_init(void*)+0x256)
[0x85613d4]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(ha_initialize_handlerton(st_plugin_int*)+0xa3)
[0x840cadf]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld
[0x84c479e]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(plugin_init(int*,
char**, int)+0x695) [0x84c5d05]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld
[0x82c4f31]
/export/home/pb2/test/sb_1-561306-1243988868.72/mysql-6.0.12-alpha-linux-i686-test/libexec/mysqld(main+0x1de)
[0x82c5eea]
/lib/libc.so.6(__libc_start_main+0xdc) [0x736dec]

How to repeat:
Seen once when running falcon_pagesize2K
[3 Jun 2009 12:04] Olav Sandstaa
Parts of the call stack is similar to bug#44744 and bug#45297. Setting this to analyzing
until having checked if this is the same as bug#44744 or bug#45297.
[5 Jun 2009 13:45] Olav Sandstaa
Having re-run the recovery I see that the actual crash happens when we try to read the
same page that was "marked as free in PIP". This happens since this block is after EOF.
The output from the end of the recovery process is:

During recovery, fetched page 5909 tablespace 1 type 9 marked free in PIP
[Falcon] Error: pread on file
/home/os136802/mysql/develop/bug/bug45301/a1/vardirs/master-data/falcon_user.fts from 
page 5909 (offset 24203264) returned 0 bytes instead of 4096 (possible read behind EOF)

The size of the falcon user table space is 23982080 bytes which corresponds to 5855
blocks. So the block we try to read in about 50 blocks beyond the EOF.
[5 Jun 2009 14:06] Olav Sandstaa
This seems to be a different bug than Bug#44744. In bug#44744 we seems to read invalid
data from a data page while in this bug the page number for the overflow page we read
from the data page is from within the valid area of the data page and it seems to be a
reasonable page number (about 50 beyond EOF).

Setting this bug to verified.
[10 Jul 2009 12:07] Olav Sandstaa
Here is a summary of what happens during recovery when this crash happens:

1. Recovery is processing an UpdateRecords log record (record id 1602). It finds the
record in the Record Locator Page which points to DataPage 461.
2. DataPage 461 looks fine (ie. no evidence of curruption). In this page the record's
slot says that the record is stored in an overflowPage (with page number 5909). So the
first thing the code tires to do is to delete this overflow page (and then the problems
starts)
3. The code tries to fetch this page from disk by calling Cache::fetchPage()
4. First indication of something wrong: Cache::fetchPage() checks that this page is
marked as used in the PageInventoryPage. This results in the error written:

      "During recovery, fetched page 5909 tablespace 1 type 9 marked free in PIP"

5. The code "corrects" this by calling PageInventoryPage::markPageInUse()
6. Next indication of something wrong: When IO::readPage() tries to read the page it get
an EOF error.

Conclusion: The overflow page does not exits in the database file. 

I have also done a scan of the database and found that there are also three other
DataPages that refers to OverflowPage that are behind EOF (data pages refer to overflow
pages 5902, 5909, 5913, 5929). The serial log (including log going back about three
checkpoints) only contains SRLOverflowPages log records where the max overflow page size
if 5854 (ie. all are within the existing database file).
[10 Jul 2009 12:16] Olav Sandstaa
As this crash has not been possible to reproduce with the falcon_pagesize2K test here is
the theory on what happened leading to the crash:

Just before the crash happened one of the Gophers executed the UpdateRecords log post.
This did the following changes to the database buffer:

  1. Updated the main data page for the record.
  2. Allocated a brand new overflow page that at this time did not exists in the database
file.
  3. Updated the page inventory page (due to allocating the new overflow page)

The gopher thread also wrote a SRLOverflowPages log record to the serial log which does
not get flushed.

Immediately after a checkpoint starts. During the checkpoint MySQL get "killed". At the
time it get killed it has written the datapage to disk but not the new overflow page nor
the PIP page.

During recovery the SRLOverflowPages log record is not found in the serial log due to not
having flushed it, so the PIP is not updated and the overflow page is not allocated during
phase 2 of recovery. During phase three when we redo the UpdateRecords log record the
crash occures due to we try to delete the overflow page that never got it to the disk
before the crash.
[10 Jul 2009 12:19] Olav Sandstaa
By instrumenting the Falcon code I have manged to make a "reproduction" that creates a
almost identical crash. Here is what I do (slightly simplified):

1. Run a single transaction that updates 1000 records in a table. Each record consist of
a key and a 4100 byte varchar field (initial size of the varchar field is 20 bytes) - so
this increases the size of every record by about 4K.

2. When this transaction is executed it produces one UpdateRecords log record and one
Commit log record. Serial log is flushed.

3. After the transaction has committed the Gopher executes the UpdateRecords log record.
For each record that is updated the Gopher will do:

    a) allocate a new OverflowPage
    b) write a new SRLOverflowPages log record
    c) update the original data page

   Note: no flushing of the serial log take place here and no writing of the database
buffer  takes place. But the size of the database have grown a lot due to allocating  a
bunch of new OverflowPages.

4. Checkpoint  starts:  and  get KILLED before it has written out all  pages.

   Note: no flushing has occurred on the SerialLog since the 1000 SRLOverflowPages
records got written - these are likely lost.

5. Recovery starts.

6.  Stage 2 recovery: no SRLOverflowPages are in the log

7. Stage 3 recovery (REDO): We redo the UpdateRecords log by doing the following:

      Section::updateRecord()
      DataPage::updateRecord(): if this DataPage was one of the pages that got written to
the disk before MySQL got killed it will find that there is an overflow page for this
record, so we try to delete that:
      DataPage::deleteOverflowPage(): we ask the Cache to fetch the page from disk:
      Cache::fetchPage(): ask the IO to read the page from disk
      IO::readPage(): if this page was newly allocated (which is likely) and was not
written to the disk before MySQL got killed this now leads to a "fatal" read of a block
that does not exits on disk - and recovery fails.
[10 Jul 2009 12:52] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/78392

2753 Olav Sandstaa	2009-07-10
      Fix for Bug#45301 Recovery crash due to fetched page marked free in PIP
      
      This crash occured during recovery due to at the time of the crash the main data
page for a record had
      written to the data base file while the PIP and the newly allocatedoverflow page
for the record had not 
      been written. The corresponding SRLOverflowPages log record was likely not flushed
to the serial log.
      This situation happened likely due to MySQL being killed during a Falcon check
point.
      
      This is now handled by extending the IO::readPage() so that an exception is thrown
in the cases there
      is reads beyond end-of-file during recovery. This exception is caught in
DataPage::deleteOverflowPages.
      If it happens during recovery, DataPage::deleteOverflowPages will handle the
exception by marking the
      overflow page as free in the Page Inventory Page.
      
      Note that if there are more than one overflow page for this record we might have
permanent loss of a few
      blocks in the database.
      
      The patch also fixes an missing handling of exceptions in Cache::fetchPage() where
an IO excecption would
      lead to a BDB having a use count that never got released and which resulted in a
crash during shutdown.
     @ storage/falcon/Cache.cpp
        In Cache::fetchPage: Add missing handling of exceptions that might occur during
reading of pages. If an exception was thrown the old code would not do proper clean-up
and the allocated BDB would still be allocated and have a use count. This would never be
released, take a buffer slot forever and lead to a crash during normal shutdown due to
the use count on the BDB. The changed code will handle the exception, release the
allocated BDB and rethrow the exception.
     @ storage/falcon/DataPage.cpp
        Add handling of read errors that occures when trying the access an overflow page
that should be deleted during recovery. If an IO error occurs we handle this by ignoring
this overflow page and just mark it as free in the Page Inventory Page.
     @ storage/falcon/Dbb.cpp
        Add a method to be used for checking if we currently is doing recovery on the
DBB.
     @ storage/falcon/Dbb.h
        Add a method to be used for checking if we currently is doing recovery on the
DBB.
     @ storage/falcon/IO.cpp
        Change how a read error that occurs during recovery is handled:
        1. Do not set the "fatal error" status of the IO class
        2. Throw an exception also if it is read beyond end-of-file (instead of crashing)