MySQL Bugs: #36526: Falcon deadlock when running sysbench

Bug #36526	Falcon deadlock when running sysbench
Submitted:	6 May 2008 10:57	Modified:	16 Oct 2008 21:12
Reporter:	Philip Stoev	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S1 (Critical)
Version:	6.0.5	OS:	Any
Assigned to:	Kevin Lewis	CPU Architecture:	Any
Tags:	F_RECORD TREE

Description:
When running sysbench with 2000 concurrent users, the so called "high concurrency system test", Falcon deadlocked with all threads blocked.

+-----------------------------------------+-----------+-----------+--------+--------------+
| CALLER                                  | SHARED    | EXCLUSIVE | WAITS  | QUEUE_LENGTH |
+-----------------------------------------+-----------+-----------+--------+--------------+
| StorageConnection::getStorageConnection | 152881228 |      2002 |      6 |            0 |
| Database::retireRecords                 |         0 |      1333 |    548 |          273 |
| Cache::fetchPage                        | 455389368 |    856942 |  20169 |            1 |
| Cache::ioThread                         |         0 |     39164 |   1305 |            0 |
| Cache::markDirty                        |       394 |    609427 |     34 |            0 |
| SerialLogWindow::deactivateWindow       |    976720 |   5938169 | 260393 |            1 |
| Gopher::gopherThread pending            |   6431371 |   3185441 |   7649 |            0 |
| Index::detachDeferredIndex              |        10 |        14 |      1 |            0 |
| Transaction::rollback                   |  29069200 |    754546 | 659042 |            1 |
| Transaction::commit                     |       786 |   1366658 |     26 |            0 |
| StorageTableShare::setSequenceValue     |         0 |    868811 |     42 |            0 |
| Table::fetch                            | 671846999 |      7280 |   6188 |            0 |
| Table::validateUpdate                   |    193350 |       637 |      3 |            0 |
| Table::validateUpdate                   |    185776 |       653 |      1 |            0 |
| Table::validateUpdate                   |    208054 |       678 |      3 |            0 |
| Table::validateUpdate                   |    224208 |       632 |      2 |            0 |
| Table::validateUpdate                   |    225605 |       712 |      1 |            0 |
| Table::validateUpdate                   |    192182 |       659 |      2 |            0 |
| Table::validateUpdate                   |    208775 |       634 |      1 |            0 |
| Table::validateUpdate                   |    233713 |       663 |      2 |            0 |
| Table::validateUpdate                   |    204471 |       638 |      1 |            0 |
| Table::validateUpdate                   |    202926 |       658 |      3 |            0 |
| Table::validateUpdate                   |    218230 |       656 |      1 |            0 |
| Index::detachDeferredIndex              |   7870505 |     13400 |    112 |            0 |
| Index::detachDeferredIndex              |    114467 |   1380511 |    494 |            0 |
| RecordLeaf::fetch                       |   2438021 |         1 |      1 |            0 |
| RecordLeaf::fetch                       |   3217169 |         1 |      1 |            0 |
| RecordLeaf::fetch                       |   1990704 |         1 |      4 |            1 |
| RecordLeaf::fetch                       |   1561497 |         1 |     52 |           25 |
+-----------------------------------------+-----------+-----------+--------+--------------+

How to repeat:
Test case will hopefully follow shortly. I am also trying to obtain debug output from Falcon.

It appears that there is a deadlock between RecordLeaf::retireRecords2 and RecordLeaf::fetch.

I am attaching debug output on thread stalls and backtraces.

Thread stalls for bug 36526

Attachment: bug36526_stalls.txt (text/plain), 11.66 KiB.

thread apply all bt output for bug 36526

Attachment: bug36526_threads.zip (application/x-zip-compressed, text), 143.76 KiB.

This is not a traditional deadlock in which syncObjects are gained in opposite order.  

The Scavenger thread is holding up 642 other threads.  It has this call stack.
Thread 653 (process 9901):
#2  SyncObject::wait (..., type=Exclusive, ...) at SyncObject.cpp:413
#3  SyncObject::lock (..., type=Exclusive, timeout=0) at SyncObject.cpp:265
#4  Sync::lock (..., type=Exclusive) at Sync.cpp:58
#5  RecordLeaf::retireRecords() at RecordLeaf.cpp:171
#6  RecordGroup::retireRecords() at RecordGroup.cpp:124
#7  RecordGroup::retireRecords() at RecordGroup.cpp:124
#8  Table::retireRecords() at Table.cpp:1805
#9  Database::retireRecords() at Database.cpp:1807
#10 Database::scavenge() at Database.cpp:1722
#11 Scavenger::scavenge() at Scavenger.cpp:58

And it holds these SyncObjects;

#5  RecordLeaf::retireRecords()   waiting, Exclusive, syncPrior
#5  RecordLeaf::retireRecords()   locked,  Exclusive, RecordLeaf::syncObject
      There are 52 threads waiting for a shared lock on this 
      in RecordLeaf::fetch()
#8  Table::retireRecords()        locked,  Exclusive, Table::syncObject
      There are 42 threads waiting for an exclusive lock on this 
      in Table::validateAndInsert
#9  Database::retireRecords()     locked,  Exclusive, Database::syncScavenge
      There are 548 threads waiting for an exclusive lock on this
      with this call stack;
      #2  SyncObject::wait
      #3  SyncObject::lock
      #4  Sync::lock 
      #5  Database::retireRecords
      #6  Database::forceRecordScavenge
      #7  Table::allocRecordVersion
      #8  Table::fetchForUpdate 

I cannot find the thread 

662 total threads
643 waiting threads
19 non-waiting threads
  2 IO threads
  1 Ticker
  1 Pagewriter
  5 Gopher
  1 Scheduler
  4 io_handler_thread
  4 start_thread
  1 main

The scavenger thread which is waiting on a syncPrior is holding up everything.  I could not find a lock on a syncPrior in any of the waiting call stacks. There must be some kind if code path in which a syncPrior is left locked.

See bug #37395 for a situation with a simpliar stalled threads output.

This bug was fixed in 6.0.6 by a re-organization of the syncPrior locks.

Going back to the code for 6.0.5 I can see the deadlock;
The scavenger thread has an exclusive lock on Table::syncObject and is waiting on an exclusive lock on Table::syncPrior.

There are 42 threads stuck in Table::validateAndUpdate.  41 of these call it from Table::deleteRecord() where the 6.0.5 code got shared locks on Table::syncPrior before calling validateAndUpdate where they are waiting for an Exclusive lock on Table::syncObject.

The code as of version 6.0.6 does not lock Table::syncPrior in Table::deleteRecord() which avoids this deadlock.  In fact, it does not even lock syncPrior in validateAndUpdate.  It has this comment;
   // Do not need syncPrior here since this is a new record.
   // No other thread can see this record's priorVersion pointer.