Bug #44096 Exception: Recovery failed: corrupted serial log
Submitted: 5 Apr 2009 13:10 Modified: 15 May 2009 14:49
Reporter: Philip Stoev Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Falcon storage engine Severity:S3 (Non-critical)
Version:6.0-falcon-team OS:Any
Assigned to: Ann Harrison CPU Architecture:Any
Tags: F_RECOVERY
Triage: Triaged: D1 (Critical)

[5 Apr 2009 13:10] Philip Stoev
Description:
After a very simple RQG kill -9 test, Falcon recovery failed as follows:

Recovering database /ide/april05/simple/vardir1/master-data-copy/falcon_master.fts ...

First recovery block is 84343
Last recovery block is 92605
Exception: corrupted serial log
Exception: Recovery failed: corrupted serial log
090405 16:04:29 [ERROR] Falcon: Recovery failed: corrupted serial log
090405 16:04:29 [ERROR] Plugin 'Falcon' init function returned error.
090405 16:04:29 [ERROR] Plugin 'Falcon' registration as a STORAGE ENGINE failed.

This is the first and only occurrence of this exception so far.

Before the kill -9, the following was printed in the error log:

------------------------------------
Stalled threads
  Thread 0x7f2fe13bb9e0 (-792102576) sleep=0, grant=0, locks=1, who 0, parent=(nil)
    Pending StorageHandler::getStorageConnection state 0 (1) syncObject 0x7f2fe6775380
  Thread 0x7f2fe674e980 (-789706416) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending StorageHandler::getStorageConnection state 0 (1) syncObject 0x7f2fe6775380
  Thread 0x7f2fe128d468 (-789972656) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending StorageHandler::getStorageConnection state 0 (1) syncObject 0x7f2fe6775380
  Thread 0x7f2fe12cf860 (-790238896) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending StorageHandler::getStorageConnection state 0 (1) syncObject 0x7f2fe6775380
  Thread 0x7f2fe131def8 (-790505136) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending StorageHandler::getStorageConnection state 0 (1) syncObject 0x7f2fe6775380
  Thread 0x7f2fe136fcd8 (-790771376) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending StorageHandler::getStorageConnection state 0 (1) syncObject 0x7f2fe6775380
  Thread 0x7f2fe138eab0 (-791037616) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending StorageHandler::getStorageConnection state 0 (1) syncObject 0x7f2fe6775380
  Thread 0x7f2fe13a1e08 (-791303856) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending StorageHandler::getStorageConnection state 0 (1) syncObject 0x7f2fe6775380
  Thread 0x7f2fe13a55d0 (-791570096) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending StorageHandler::getStorageConnection state 0 (1) syncObject 0x7f2fe6775380
  Thread 0x7f2fe13b7480 (-791836336) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending StorageHandler::getStorageConnection state 0 (1) syncObject 0x7f2fe6775380
Stalled synchronization objects:
  SyncObject 7f2fe6775380: state 16, readers 0, monitor 0, waiters 10
    Waiting thread e674e980 (-789706416), type 1; StorageHandler::getStorageConnection
    Waiting thread e128d468 (-789972656), type 1; StorageHandler::getStorageConnection
    Waiting thread e12cf860 (-790238896), type 1; StorageHandler::getStorageConnection
    Waiting thread e131def8 (-790505136), type 1; StorageHandler::getStorageConnection
    Waiting thread e136fcd8 (-790771376), type 1; StorageHandler::getStorageConnection
    Waiting thread e138eab0 (-791037616), type 1; StorageHandler::getStorageConnection
    Waiting thread e13a1e08 (-791303856), type 1; StorageHandler::getStorageConnection
    Waiting thread e13a55d0 (-791570096), type 1; StorageHandler::getStorageConnection
    Waiting thread e13b7480 (-791836336), type 1; StorageHandler::getStorageConnection
    Waiting thread e13bb9e0 (-792102576), type 1; StorageHandler::getStorageConnection
------------------------------------

How to repeat:
The RQG command the produced this failure is 

perl runall.pl --engine=Falcon  --reporters=Deadlock,ErrorLog,Backtrace,Recovery,Shutdown   --mysqld=--loose-falcon-lock-wait-timeout=1   --mysqld=--loose-innodb-lock-wait-timeout=1   --mysqld=--log-output=none   --mysqld=--skip-safemalloc   --mysqld=--transaction-isolation=REPEATABLE-READ --mysqld=--falcon-page-size=16K  --rows=1000 --threads=64 --mysqld=--falcon-page-cache-size=1K   --mask=55525   --queries=100000000   --duration=900   --basedir=/build/bzr/6.0-falcon-team  --gendata=conf/combinations.zz  --grammar=conf/combinations.yy

The only special thing is the very small ammount of page cache (which the engine adjusted upwards on startup) combined with a non-default page-size.
[5 Apr 2009 13:19] Philip Stoev
Tablespace before recovery:

http://mysql-systemqa.s3.amazonaws.com/bug44096.zip
[5 Apr 2009 15:31] Vladislav Vaintroub
changed severity to non-critical based on "all non-default" parameter values and extremely hard to reproduce scenario
[8 Apr 2009 15:53] Ann Harrison
Underscoring Vlad's comment.  The page cache size is set to 1K, which is
1/16th of the page size.  Falcon will round this to a whole page, but a
page cache that holds only one page is really small.
[8 Apr 2009 16:05] Philip Stoev
Falcon adjusts the page cache size to 2Mb:

090408 19:04:14 [Warning] option 'falcon-page-cache-size': unsigned value 1024 adjusted to 2097152

so the page cache size is actually fairly reasonable.
[9 Apr 2009 20:10] Ann Harrison
Had nothing to do with the cache size - the crash happened after
an old log file had been truncated.  Recovery was entirely from
the new log file, but still checked that the two files were
contiguous, which they weren't because the old one had been
truncated.
[10 Apr 2009 9:04] Philip Stoev
I am afraid that before we close this bug I would need that non-record serial log operations are also tagged with Vlad's crash hooks (http://lists.mysql.com/commits/68748).

This way a crash can be produced after the operation in question. Such operations would be anything that operates on the log, including:

* writing any headers, footers, EOF records, checksums, etc. for the serial log;
* writing any headers, footers, EOF records, checksums, etc. for the individual serial log windows;
* truncating the log
* rotating from one log to the other (if this involves a disk write)
[13 Apr 2009 16:21] Kevin Lewis
=== modified file 'storage/falcon/SerialLog.cpp'
--- storage/falcon/SerialLog.cpp    2009-04-09 20:04:02 +0000
+++ storage/falcon/SerialLog.cpp    2009-04-10 15:52:23 +0000
@@ -274,6 +274,10 @@
         }

     // Pick a window to start the search for the last block
+    // If there are two files, we want to find the last block
+    // in the second file.  Given the way the windows were
+    // allocated, the fl2 window is linked in before the file1
+    // window.  If they're in the other order swap them around.

     SerialLogWindow *recoveryWindow = NULL;
     SerialLogWindow *otherWindow = NULL;
@@ -310,38 +314,63 @@
         return;
         }

-    // Look through windows looking for the very last block
+    // recoveryWindow is the first window in the second file.
+    // Look through windows looking for the very last block,
+    // starting by remember the first block in the recoverWindow
+    // as recoveryBlock and it's number as recoveryBlockNumber.
+    // These are used only to determine whether recovery is all
+    // from one file, or whether it needes to start in the earlier
+    // file.

     SerialLogBlock *recoveryBlock = recoveryWindow->firstBlock();
     recoveryBlockNumber = recoveryBlock->blockNumber;
     SerialLogBlock *lastBlock = findLastBlock(recoveryWindow);
-    Log::log("\nFirst recovery block is " I64FORMAT "\n",
+
+    // The last block's "readBlock" number is the first block where we
+    // actually perform recovery - it the first block that has something
+    // to do with a transaction that's not write complete.
+
+    // Debugging hint.  If there are unidentified objects in recovery -
+    // unwritten pages or something, try starting recovery at the first
+    // block in the first file - if and only if the two are contigous.
+
+    uint64 readBlockNumber = lastBlock->readBlockNumber;
+   
+    if (readBlockNumber == 0)
+        readBlockNumber = lastBlock->blockNumber;
+
+    Log::log("\nFirst block in the serial log is " I64FORMAT "\n",
             (otherWindow) ? otherWindow->firstBlock()->blockNumber : recoveryBlockNumber);
-    Log::log("Last recovery block is " I64FORMAT "\n", lastBlock->blockNumber);
+    Log::log("Last block in the serial log is " I64FORMAT "\n", lastBlock->blockNumber);
+    Log::log("First block used in recovery is " I64FORMAT "\n", readBlockNumber);
+
+    // the nextBlockNumber is the first block number to use after
+    // recovery starts
+
+    nextBlockNumber = lastBlock->blockNumber + 1;
+
+    // If we're using both files, they'd better be contiguous.  Check and
+    // when done, release any window allocated in the process. A side
+    // effect of the check is that it sets up the first window's last
+    // block number - that's used later.

     if (otherWindow)
         {
         SerialLogBlock *block = findLastBlock(otherWindow);

-        // If we have to use both files, be sure they're contiguous
-
-        if (block && lastBlock->readBlockNumber < recoveryBlock->blockNumber)
-            if (block->blockNumber != (recoveryBlockNumber - 1))
+        if (block && block->blockNumber != (recoveryBlockNumber - 1))
+            if (readBlockNumber < recoveryBlock->blockNumber)
                 throw SQLError(LOG_ERROR, "corrupted serial log");
        
         SerialLogWindow *window = findWindowGivenBlock(block->blockNumber);
         window->deactivateWindow();
         }

-    // Find read block
-
-    uint64 readBlockNumber = lastBlock->readBlockNumber;
-   
-    if (readBlockNumber == 0)
-        readBlockNumber = lastBlock->blockNumber;
-       
-    Log::log("Recovery read block is " I64FORMAT "\n", readBlockNumber);
-    nextBlockNumber = lastBlock->blockNumber + 1;
+    // OK.  Everything looks good.  Release the window we used
+    // probing the log files, open a window on the first block
+    // we're going to use for recovery.  That should work, but
+    // if it doesn't, complain.
+
     lastWindow->deactivateWindow();
     SerialLogWindow *window = findWindowGivenBlock(readBlockNumber);
    
@@ -356,12 +385,19 @@
     writeBlock = NULL;
     control.setWindow(window, block, 0);
     SerialLogRecord *record;
+
+    // OK, we're good to go.  Start recovering.
+
     pass1 = true;
     recoveryPages = new RecoveryObjects(this);
     recoverySections = new RecoveryObjects(this);
     recoveryIndexes = new RecoveryObjects(this);
-    recoveryPhase = 1;    // Take Inventory (serialLogTransactions, recoveryObject states, last checkpoint)
+    recoveryPhase = 1;   
    
+    // Phase 1 - read from the start to end of the part of the
+    // log that's involved in recovery.  Take Inventory of serialLogTransactions,
+    // recoveryObject states, last checkpoint)
+
     Log::log("Recovery phase 1...\n");

     unsigned long int recordCount = 0;
[15 Apr 2009 16:59] Bugs System
Pushed into 6.0.11-alpha (revid:hky@sun.com-20090415164923-9arx29ye5pzxd4zf) (version source revid:hky@sun.com-20090414213212-ibcjs12ic1v2h0e5) (merge vers: 6.0.11-alpha) (pib:6)
[15 Apr 2009 22:12] Kevin Lewis
The key change here, besides the very helpful comments, is 
+            if (readBlockNumber < recoveryBlock->blockNumber)

This prevents the SQLError from being thrown when the old serial log file has been truncated.
[15 May 2009 14:49] MC Brown
Internal/test fix. No changelog entry required.