MySQL Bugs: #41665: Falcon crash in Transaction::waitForTransaction

Bug #41665	Falcon crash in Transaction::waitForTransaction
Submitted:	21 Dec 2008 17:58	Modified:	15 May 2009 16:17
Reporter:	Philip Stoev	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S1 (Critical)
Version:	6.0-falcon-team	OS:	Any
Assigned to:	Olav Sandstå	CPU Architecture:	Any
Tags:	F_TRANSACTION

Description:
When executing a transactional RQG workload in 250 threads (500 connected clients total), mysqld crashed as follows (backtrace may be incomplete):

#0  0x00000000009898d1 in Transaction::waitForTransaction (this=0x7fc9347d60b8, transaction=0x7fc93c413938, transId=0, deadlock=0x7fc943d48e9f)
    at Transaction.cpp:982
#1  0x0000000000989b28 in Transaction::getRelativeState (this=0x7fc9347d60b8, transaction=0x7fc93c413938, transId=1976189, flags=1) at Transaction.cpp:849
#2  0x0000000000989c12 in Transaction::getRelativeState (this=0x7fc9347d60b8, record=0x7fc934c65e40, flags=1) at Transaction.cpp:795
#3  0x00000000009799f0 in Table::fetchForUpdate (this=0x7fc998b67800, transaction=0x7fc9347d60b8, source=0x7fc934c65e40, usingIndex=true) at Table.cpp:3525
#4  0x0000000000960da0 in StorageDatabase::nextIndexed (this=0x7fc998824210, storageTable=0x7fc93472f038, recordBitmap=0x7fc99883bb00, recordNumber=4,
    lockForUpdate=true) at StorageDatabase.cpp:430
#5  0x00000000009683eb in StorageTable::nextIndexed (this=0x7fc93472f038, recordNumber=0, lockForUpdate=true) at StorageTable.cpp:171
#6  0x000000000095815d in StorageInterface::index_next (this=0x7fc9340fe0c0, buf=0x7fc9340fe380 "Ъ") at ha_falcon.cpp:1687
#7  0x000000000095767f in StorageInterface::read_range_first (this=0x7fc9340fe0c0, start_key=0x7fc9340fe190, end_key=0x7fc9340fe1b0, eq_range_arg=true,
    sorted=false) at ha_falcon.cpp:1655
#8  0x00000000007fef26 in handler::multi_range_read_next (this=0x7fc9340fe0c0, range_info=0x7fc943d49230) at handler.cc:4270
#9  0x00000000007dc9c3 in QUICK_RANGE_SELECT::get_next (this=0x7fc934613500) at opt_range.cc:8596
#10 0x00000000007f888a in rr_quick (info=0x7fc943d494d0) at records.cc:322
#11 0x00000000007824bb in mysql_update (thd=0x3ce69f8, table_list=0x3d70c40, fields=@0x3ce8ae8, values=@0x3ce8ef0, conds=0x3d71de0, order_num=0, order=0x0,
    limit=18446744073709551615, handle_duplicates=DUP_ERROR, ignore=false) at sql_update.cc:582
#12 0x00000000006cdc7a in mysql_execute_command (thd=0x3ce69f8) at sql_parse.cc:3002
#13 0x00000000006d37d1 in mysql_parse (thd=0x3ce69f8,
    inBuf=0x3d70a10 "UPDATE `table10_falcon_int_autoinc` SET `int_key` = `int_key` + IF(`pk` % 2 = 1 , 30, -30) WHERE `pk` IN ( 4 , 7 )", length=114,
    found_semicolon=0x7fc943d4af00) at sql_parse.cc:5732
#14 0x00000000006d43bc in dispatch_command (command=COM_QUERY, thd=0x3ce69f8,
    packet=0x3cdc929 " UPDATE `table10_falcon_int_autoinc` SET `int_key` = `int_key` + IF(`pk` % 2 = 1 , 30, -30) WHERE `pk` IN ( 4 , 7 ) ",
    packet_length=116) at sql_parse.cc:1007
#15 0x00000000006d58e5 in do_command (thd=0x3ce69f8) at sql_parse.cc:690
#16 0x00000000006c3a69 in handle_one_connection (arg=0x3ce69f8) at sql_connect.cc:1154
#17 0x000000315b0073da in start_thread () from /lib64/libpthread.so.0
#18 0x000000315a4e627d in clone () from /lib64/libc.so.6

(gdb) list
977
978             if (!COMPARE_EXCHANGE_POINTER(&waitingFor, NULL, transaction))
979                     FATAL("waitingFor was not NULL");
980
981             volatile Transaction *trans;
982             for (trans = transaction->waitingFor; trans; trans = trans->waitingFor) <<<<<<<<<<<<<<<<< HERE
983                     if (trans == this)
984                             {
985                             *deadlock = true;
986                             break;

(gdb) print trans
$1 = (volatile Transaction *) 0xeeeeeeeeeeeeeeee

The output is confusing because the MySQL segfault handler does not appear to have been called properly. At the same time, the error log contains just the following:

/build/bzr/6.0-falcon-team/sql/mysqld(my_print_stacktrace+0x32)[0xc2bea8]
/build/bzr/6.0-falcon-team/sql/mysqld(handle_segfault+0x2a6)[0x6ba563]
/lib64/libpthread.so.0[0x315b00f0f0]
/build/bzr/6.0-falcon-team/sql/mysqld(_ZN11Transaction18waitForTransactionEPS_jPb+0x179)[0x9898d1]
/build/bzr/6.0-falcon-team/sql/mysqld(_ZN11Transaction16getRelativeStateEPS_jj+0x9e)[0x989b28]

Maybe a bug in the OS thread implementation?

How to repeat:
perl runall.pl \
--basedir=/build/bzr/6.0-falcon-team \
--grammar=conf/falcon_transactions.yy \
--gendata=conf/transactions.zz \
--engine=Falcon \
--queries=100000 \
--validator=DatabaseConsistency \
--mysqld=--falcon-lock-wait-timeout=1 \
--mysqld=--log-output=file \
--mysqld=--falcon-record-memory-max=1G \
--mysqld=--falcon-page-cache-size=1G \
--mysqld=--skip-safemalloc \
--mem \
--duration=3600 \
--threads=250 \
--mysqld=--max-connections=2048

Grammar file for bug 41665

Attachment: bug41665.yy (application/octet-stream, text), 1.44 KiB.

To reproduce within 5 min:

$ perl runall.pl \
  --basedir=/build/bzr/6.0-falcon-team \
  --grammar=bug41665.yy \
  --gendata=conf/transactions.zz \
  --engine=Falcon \
  --queries=100000 \
  --mysqld=--falcon-lock-wait-timeout=1 \
  --mysqld=--log-output=file \
  --mysqld=--falcon-record-memory-max=1G \
  --mysqld=--falcon-page-cache-size=1G \
  --mysqld=--skip-safemalloc \
  --mem \
  --duration=360 \
  --mysqld=--max-connections=2048 \
  --threads=50

The queries involved are simple UPDATEs that update records from a 10-row table over and over again.

Olav,  I was running the following RQG script today, trying to reproduce 42340, and I hit this crash. (on windows, dual core)
perl runall.pl \
  --mysqld=--falcon-page-size=16K \
  --rows=1000 \
  --threads=4 \
  --mask=1487 \
  --queries=1000000 \
  --duration=300 \
  --basedir=C:\Work\bzr\Chg-09\mysql-6.0-falcon-team \
  --engine=Falcon \
  --grammar=conf/combinations.yy \
  --gendata=conf/combinations.zz \
  --reporter=ErrorLog,Backtrace \
  --mysqld=--log-output=none

Partial call stack;

	Transaction::waitForTransaction(Transaction * transaction=0x0420cc40, unsigned int transId=0, bool * deadlock=0x0739b3a3)  Line 983 + 0xe bytes	C++
	Transaction::getRelativeState(Transaction * transaction=0x0420cc40, unsigned int transId=21002, unsigned int flags=1)  Line 853	C++
	Transaction::getRelativeState(Record * record=0x069b4e40, unsigned int flags=1)  Line 808 + 0x39 bytes	C++
	Table::checkUniqueRecordVersion(int recordNumber=2915, Index * index=0x04177eb0, Transaction * transaction=0x04211238, RecordVersion * record=0x069b4f90, Sync * syncUnique=0x0739df5c)  Line 2625 + 0xe bytes	C++
 	Table::checkUniqueIndex(Index * index=0x04177eb0, Transaction * transaction=0x04211238, RecordVersion * record=0x069b4f90, Sync * sync=0x0739df5c)  Line 2510 + 0x1f bytes	C++
 	Table::insertIndexes(Transaction * transaction=0x04211238, RecordVersion * record=0x069b4f90)  Line 1268 + 0x18 bytes	C++
 	Table::insert(Transaction * transaction=0x04211238, Stream * stream=0x040e1800)  Line 3056	C++
 	StorageDatabase::insert(Connection * connection=0x0419d0f8, Table * table=0x040bf878, Stream * stream=0x040e1800)  Line 267	C++
 	StorageTable::insert()  Line 109 + 0x28 bytes	C++

So when this call was made, the record had a transaction attached.
storage\falcon\Transaction.cpp(808):	
 	State state = getRelativeState(record->getTransaction()
But by the time it got to the crash site, the record->transaction was null and the transaction memory location was freed.

That makes this pretty much a duplicate of Bug#41357.  A transaction can get purged just after its address was read from the RecorVersion object.
The email chain 'Problems with record visibility and how it is computed' on falcon@lists.mysql.com adresses this problem.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/70735

3087 Olav Sandstaa	2009-03-27
      Fix for Bug #41665 Falcon crash in Transaction::waitForTransaction 
      
      This crash occurred in the deadlock detector code. The current thread
      A is traversing the waitingFor list while another thread C has just
      aborted (or possibly committed) and is releasing its reference count
      on the transaction state object. A third thread, B, that is just in
      front of C in the waitingFor list is waken up, and also releases its
      reference count on C. If A is inspecting object C as this happens,
      this object is deleted - and can result in thread A crashing.
      
      The fix to Transaction::waitForTransaction(): Before releasing the
      reference count on the transaction state object we aquire an exclusive
      lock on the active transaction list.  This will avoid that the
      transaction state object is released and deleted while another thread
      is transversing the waitingFor list.
      
      Also added a missing call to transState->release() in the catch block.
     @ storage/falcon/Transaction.cpp
        Fix to Transaction::waitForTransaction(): Before releasing the
        reference count on the transaction state object we aquire an exclusive
        lock on the active transaction list.  The cause for this is that
        during the deadlock detektion code another thread might have used this
        transaction state object's waitingFor pointer to navigate to the next
        transaction state object - and this thread might be the only one that
        has a reference count on that transaction state object.  This will
        avoid that the transaction state object is released and deleted while
        another thread is transversing the waitingFor list.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/70762

3084 Olav Sandstaa	2009-03-28
      Fix for Bug #41665 Falcon crash in Transaction::waitForTransaction 
      
      This crash occurred in the deadlock detector code. The current thread
      A is traversing the waitingFor list while another thread C has just
      aborted (or possibly committed) and is releasing its reference count
      on the transaction state object. A third thread, B, that is just in
      front of C in the waitingFor list is waken up, and also releases its
      reference count on C. If A is inspecting object C as this happens,
      this object is deleted - and can result in thread A crashing.
      
      The fix to Transaction::waitForTransaction(): Before releasing the
      reference count on the transaction state object we aquire an exclusive
      lock on the active transaction list.  This will avoid that the
      transaction state object is released and deleted while another thread
      is transversing the waitingFor list.
      
      Also added a missing call to transState->release() in the catch block.
     @ storage/falcon/Transaction.cpp
        Fix to Transaction::waitForTransaction(): Before releasing the
        reference count on the transaction state object we aquire an exclusive
        lock on the active transaction list.  The cause for this is that
        during the deadlock detektion code another thread might have used this
        transaction state object's waitingFor pointer to navigate to the next
        transaction state object - and this thread might be the only one that
        has a reference count on that transaction state object.  This will
        avoid that the transaction state object is released and deleted while
        another thread is transversing the waitingFor list.

Pushed into 6.0.11-alpha (revid:hky@sun.com-20090402144811-yc5kp8g0rjnhz7vy) (version source revid:olav@sun.com-20090328084403-ojb0712tyxlemko4) (merge vers: 6.0.11-alpha) (pib:6)

Internal/test fix. No changelog entry required.