Bug #39678 Assertion (bdb) fails in IndexRootPage::indexMerge during recovery
Submitted: 26 Sep 2008 18:11 Modified: 15 May 2009 12:45
Reporter: John Embretsen Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Falcon storage engine Severity:S1 (Critical)
Version:6.0-falcon-team OS:Any
Assigned to: Christopher Powers CPU Architecture:Any
Tags: F_ONLINE ALTER, pb2, pushbuild, test failure
Triage: Triaged: D1 (Critical)

[26 Sep 2008 18:11] John Embretsen
Description:
Recovery attempts after running the generated falcon_online_alter test (as defined in "Pushbuild 2") fail with the following error on Linux and Solaris platforms:

[Falcon] Error: assertion (bdb) failed at line 951 in file IndexRootPage.cpp

Stack trace (linux):

# 14:03:04 stack_bottom = (nil) thread_stack 0x30c00
# 14:03:04 /export/home/pb2/test/sb-40647-1222429271.72/mysql-6.0.8-alpha-linux-i686-test/libexec/mysqld(my_print_stacktrace+0x26) [0x87e9197]
# 14:03:04 /export/home/pb2/test/sb-40647-1222429271.72/mysql-6.0.8-alpha-linux-i686-test/libexec/mysqld(handle_segfault+0x2cd) [0x82ae5a5]
# 14:03:04 [0x7c6420]
# 14:03:04 /export/home/pb2/test/sb-40647-1222429271.72/mysql-6.0.8-alpha-linux-i686-test/libexec/mysqld(Error::debugBreak()+0x12) [0x852ad44]
# 14:03:04 /export/home/pb2/test/sb-40647-1222429271.72/mysql-6.0.8-alpha-linux-i686-test/libexec/mysqld(Error::error(char const*, ...)+0x7a) [0x852adc0]
# 14:03:04 /export/home/pb2/test/sb-40647-1222429271.72/mysql-6.0.8-alpha-linux-i686-test/libexec/mysqld(Error::assertionFailed(char const*, char const*, int)+0x27) [0x852ae65]
# 14:03:04 /export/home/pb2/test/sb-40647-1222429271.72/mysql-6.0.8-alpha-linux-i686-test/libexec/mysqld(IndexRootPage::indexMerge(Dbb*, int, SRLUpdateIndex*, unsigned int)+0x17f) [0x85a17ff]
# 14:03:04 /export/home/pb2/test/sb-40647-1222429271.72/mysql-6.0.8-alpha-linux-i686-test/libexec/mysqld(SRLUpdateIndex::execute()+0xe7) [0x85dfc01]
# 14:03:04 /export/home/pb2/test/sb-40647-1222429271.72/mysql-6.0.8-alpha-linux-i686-test/libexec/mysqld(SRLUpdateIndex::redo()+0x11) [0x85dfc35]
# 14:03:04 /export/home/pb2/test/sb-40647-1222429271.72/mysql-6.0.8-alpha-linux-i686-test/libexec/mysqld(SerialLog::recover()+0x97d) [0x85f0cb9]
# 14:03:04 /export/home/pb2/test/sb-40647-1222429271.72/mysql-6.0.8-alpha-linux-i686-test/libexec/mysqld(Database::openDatabase(char const*)+0x376) [0x857bcc4]
# 14:03:04 /export/home/pb2/test/sb-40647-1222429271.72/mysql-6.0.8-alpha-linux-i686-test/libexec/mysqld(Connection::getDatabase(char const*, char const*, Threads*)+0x149) [0x856f8bb]
# 14:03:04 /export/home/pb2/test/sb-40647-1222429271.72/mysql-6.0.8-alpha-linux-i686-test/libexec/mysqld(Connection::openDatabase(char const*, char const*, char const*, char const*, char const*, Threads*)+0x198) [0x8570dd6]
# 14:03:04 /export/home/pb2/test/sb-40647-1222429271.72/mysql-6.0.8-alpha-linux-i686-test/libexec/mysqld(StorageDatabase::getOpenConnection()+0x86) [0x8534eba]
# 14:03:04 /export/home/pb2/test/sb-40647-1222429271.72/mysql-6.0.8-alpha-linux-i686-test/libexec/mysqld(StorageHandler::initialize()+0x9f) [0x8537565]
# 14:03:04 /export/home/pb2/test/sb-40647-1222429271.72/mysql-6.0.8-alpha-linux-i686-test/libexec/mysqld(StorageInterface::falcon_init(void*)+0x215) [0x85287cf]
# 14:03:04 /export/home/pb2/test/sb-40647-1222429271.72/mysql-6.0.8-alpha-linux-i686-test/libexec/mysqld(ha_initialize_handlerton(st_plugin_int*)+0x99) [0x83dfd01]
# 14:03:04 /export/home/pb2/test/sb-40647-1222429271.72/mysql-6.0.8-alpha-linux-i686-test/libexec/mysqld [0x84914c4]
# 14:03:04 /export/home/pb2/test/sb-40647-1222429271.72/mysql-6.0.8-alpha-linux-i686-test/libexec/mysqld(plugin_init(int*, char**, int)+0x5ec) [0x84928c8]
# 14:03:04 /export/home/pb2/test/sb-40647-1222429271.72/mysql-6.0.8-alpha-linux-i686-test/libexec/mysqld [0x82b0fdc]
# 14:03:04 /export/home/pb2/test/sb-40647-1222429271.72/mysql-6.0.8-alpha-linux-i686-test/libexec/mysqld(main+0x228) [0x82b1f16]
# 14:03:04 /lib/libc.so.6(__libc_start_main+0xdc) [0x125dec]
# 14:03:04 /export/home/pb2/test/sb-40647-1222429271.72/mysql-6.0.8-alpha-linux-i686-test/libexec/mysqld(__gxx_personality_v0+0x3cd) [0x81cc111]

How to repeat:
Branch or checkout the mysql-test-extra-6.0 bazaar tree and run:

$ cd mysql-test/gentest
$ ./runall.pl \
 --basedir=/path/to/mysql-6.0-falcon-team \
 --reporters=Deadlock,ErrorLog,Backtrace,Recovery \
 --mysqld=--loose-falcon-lock-wait-timeout=1 \
 --engine=Falcon \
 --grammar=conf/falcon_online_alter.yy \
 --queries=10000 \
 --duration=1200
[21 Jan 2009 21:04] Christopher Powers
Did some debugging on this...

The assertion occurs in IndexRootPage::indexMerge(), line 951, during
the processing of an SRLUpdateIndex record for indexId == 1288:

Error::debugBreak()
Error::error(char const*, ...)
Error::assertionFailed(char const*, char const*, int)
IndexRootPage::indexMerge(Dbb*, int, SRLUpdateIndex*, unsigned int) <<<<
SRLUpdateIndex::execute()
SRLUpdateIndex::redo()
SerialLog::recover()
Database::openDatabase(char const*)
Connection::getDatabase(char const*, char const*, Threads*)
Connection::openDatabase(char const*, char const*, char const*, char const*, char const*, Threads*)
StorageDatabase::getOpenConnection()
StorageHandler::initialize()
StorageInterface::falcon_init(void*)
ha_initialize_handlerton(st_plugin_int*)

void IndexRootPage::indexMerge(Dbb *dbb, int indexId,
                               SRLUpdateIndex *logRecord, TransId transId)
{
  IndexKey key;
  int recordNumber = logRecord->nextKey(&key);
[...]  
  for (; recordNumber != -1;)
    {
    // Position to insert first key (clone of addIndexEntry)
    IndexKey searchKey(key);
    searchKey.appendRecordNumber(recordNumber);
[..]      
    Bdb *bdb = findInsertionLeaf(dbb, indexId, &searchKey, ...);
    ASSERT(bdb);  <<<<

The sequence of events recorded in the serial log indicates the following:

1. Index 1288 was created
2. 10 records were inserted into the index
3. Index 1288 was deleted
4. 1 more record, id = 756, was inserted into index 1288

Recovery fails because by the time the SRLUpdateIndex entry for record 756 is 
processed, the state of the recovery object associated with index 1288 is
'active', and SRLUpdateIndex::execute() attempts to insert the record
into a non-existent index.

The recovery completes successfully if the serial log entry for record 756 is skipped.

The falcon_online_alter stress test issues thousands of multiple concurrent
online add/drop index commands--obviously an extreme scenario--so it is
likely that an attempted index update would coincide or trail the deletion
of the same index.

The MySQL online alter API has a known weakness such that concurrent operations
can sneak in between the second and third phase of an online DDL operation. This
may be a contributing factor, however, Falcon should still guard against such
occurrences.

Clearly, the serial log should not contain an SRLUpdateIndex event for a deleted index--that's the core issue here--but should recovery have skipped the SRLUpdateIndex record anyway?

Would that make recovery more robust?
[21 Jan 2009 21:17] Christopher Powers
I can provide the corrupted database files upon request (128mb tarball). This is can be easily debugged in Windows or Linux: Simply copy the files into ./mysql-test/var/master-data and fire up Falcon.

Useful breakpoints (set condition 'indexId == 1288')

IndexRootPage.cpp:  952 (no condition)
IndexRootPage.cpp:  929 ('indexId == 1288') indexMerge()
SRLCreateIndex.cpp:  62 ('indexId == 1288') indexMerge()
SRLCreateIndex.cpp:  45 ('indexId == 1288') read()
SRLDeleteIndex.cpp:  69 ('indexId == 1288') append()
SRLUpdateIndex.cpp: 149 ('indexId == 1288') read()
SRLUpdateIndex.cpp: 194 ('indexId == 1288') read()
SRLDeleteIndex.cpp:  79 ('indexId == 1288') execute()
[21 Jan 2009 22:07] Christopher Powers
For Falcon recovery enthusiasts, I've uploaded the corrupted database (118mb):

bug_39678.tar
[21 Jan 2009 22:13] Ann Harrison
In theory, the first phase of recovery should have marked that
incarnation of the index as having been deleted.  Operations on
dead incarnations of objects (tablespaces, tables, indexes) should
be skipped.
[22 Jan 2009 7:35] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/63790

2969 Christopher Powers	2009-01-22
      Bug #39678 Assertion (bdb) fails in IndexRootPage::indexMerge during recovery
      
      Improved synchronization between concurrent DROP INDEX and INSERT/UPDATE operations
      on the same table.
[22 Jan 2009 18:09] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/63827

2970 Christopher Powers	2009-01-22 [merge]
      Bug #39678 Assertion (bdb) fails in IndexRootPage::indexMerge during recovery
      
      Fixed bug in SRLUpdateIndex.
[22 Jan 2009 18:43] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/63834

2970 Christopher Powers	2009-01-22
      Bug #39678 Assertion (bdb) fails in IndexRootPage::indexMerge during recovery
            
      Improved synchronization between concurrent DROP INDEX and INSERT/UPDATE operations
      on the same table.
[13 Feb 2009 7:24] Bugs System
Pushed into 6.0.10-alpha (revid:alik@sun.com-20090211182317-uagkyj01fk30p1f8) (version source revid:christopher.powers@sun.com-20090122184002-co7njgsqxlu3zqd4) (merge vers: 6.0.10-alpha) (pib:6)
[15 May 2009 12:45] MC Brown
An entry has been added to the 6.0.10 changelog: 

When performing concurrent DROP INDEX and INSERT or UPDATE operations on a Falcon table, an assertion could occur when recovering from a crashed instance.