MySQL Bugs: #39321: Falcon deadlock between Table::retireRecords and Database::retireRecords

Bug #39321	Falcon deadlock between Table::retireRecords and Database::retireRecords
Submitted:	8 Sep 2008 15:58	Modified:	9 Jan 2009 14:13
Reporter:	Philip Stoev	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S1 (Critical)
Version:	6.0-falcon-team	OS:	Any
Assigned to:	Kevin Lewis	CPU Architecture:	Any

Description:
When executing the falcon_recovery.yy test, Falcon deadlocked as follows:

Stalled threads
  Thread 0xb708cb58 (-1489790064) sleep=0, grant=0, locks=1, who 0, parent=(nil)
    Pending Table::findField state 0 (2) syncObject 0xb3afc728
  Thread 0xb70bcc00 (-1476863088) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending Database::retireRecords(1) state 0 (1) syncObject 0xb7282f2c
  Thread 0xb70e7b70 (-1477665904) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending Table::retireRecords state 0 (2) syncObject 0xb3afc728
  Thread 0xb70e85c8 (-1478268016) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending Database::retireRecords(1) state 0 (1) syncObject 0xb7282f2c
  Thread 0xb70b9c00 (-1477063792) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending Database::retireRecords(1) state 0 (1) syncObject 0xb7282f2c
  Thread 0xb70ef9f8 (-1477465200) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending Database::retireRecords(1) state 0 (1) syncObject 0xb7282f2c
  Thread 0xb70e8058 (-1478468720) sleep=1, grant=0, locks=2, who 0, parent=(nil)
    Pending Database::retireRecords(1) state 0 (1) syncObject 0xb7282f2c
  Thread 0xb70ef118 (-1478067312) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending Database::retireRecords(1) state 0 (1) syncObject 0xb7282f2c
  Thread 0xb70dc760 (-1477264496) sleep=1, grant=0, locks=3, who 0, parent=(nil)
    Pending Database::retireRecords(1) state 0 (1) syncObject 0xb7282f2c
  Thread 0xb70dc840 (-1477866608) sleep=1, grant=0, locks=2, who 0, parent=(nil)
    Pending Database::retireRecords(1) state 0 (1) syncObject 0xb7282f2c
Stalled synchronization objects:
  SyncObject b3afc728: state -1, readers 0, monitor 0, waiters 2
    Exclusive thread b70bcc00 (-1476863088), type 1; Database::retireRecords(1)
    Waiting thread b70e7b70 (-1477665904), type 2; Table::retireRecords
    Waiting thread b708cb58 (-1489790064), type 2; Table::findField
  SyncObject b7282f2c: state -1, readers 0, monitor 0, waiters 8
    Exclusive thread b70e7b70 (-1477665904), type 2; Table::retireRecords
    Waiting thread b70e85c8 (-1478268016), type 1; Database::retireRecords(1)
    Waiting thread b70b9c00 (-1477063792), type 1; Database::retireRecords(1)
    Waiting thread b70ef9f8 (-1477465200), type 1; Database::retireRecords(1)
    Waiting thread b70e8058 (-1478468720), type 1; Database::retireRecords(1)
    Waiting thread b70ef118 (-1478067312), type 1; Database::retireRecords(1)
    Waiting thread b70bcc00 (-1476863088), type 1; Database::retireRecords(1)
    Waiting thread b70dc760 (-1477264496), type 1; Database::retireRecords(1)
    Waiting thread b70dc840 (-1477866608), type 1; Database::retireRecords(1)

Thread b70e7b70 waits on b70bcc00 but b70bcc00 waits on b70e7b70.

How to repeat:
This has only happened once after numerous test runs. Please debug this from the stalled threads output and the thread backtraces that I will upload shortly.

Stacks for bug 39321

Attachment: bug39321.stacks.txt (text/plain), 50.50 KiB.

This deadlock can happen when a Truncate command runs out of memory and has to call Database::forceRecordScavenge().  Any other thread that calls it at the same time can get into a deadlock with it because it locks Table::syncObject before Database::syncScavenge wherease most other threads will get Database::syncScavenge before Table::syncObject.  

Thread 13 
Database::truncateTable(4) (Table::syncObject) ->
...  Table::allocRecord -> Database::forceRecordScavenge ->
Database::retireRecords (Database::syncScavenge)

Thread 9
...  Record::allocRecordData -> Database::forceRecordScavenge ->
Database::retireRecords (Database::syncScavenge)
Table::retireRecords (Table::syncObject)

I think the solution is for the Database::truncateTable to also lock Database::syncScavenge before it gets started.  It is already locking these;
  Database::truncateTable(1)      Database::syncSysDDL
  Database::truncateTable(2)      Database::syncTables
  Database::truncateTable(3)      SerialLog::syncSections
  Database::truncateTable(4)      Table::syncObject

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/53743

2824 Kevin Lewis	2008-09-10
      Bug#39321 Add an exclusive lock on Database::syncScavenge in 
      Database::truncateTable before the lock of Table::syncObject
      just in case the truncateTable process has to call 
      Database::forceRecordScavenge.  syncScavenge must be locked 
      before Table::syncObject because the scavenger does it that way.
      
      According to the Deadlock Detector, syncScavenge must also be 
      locked before Database::syncTables.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/53990

2819 Vladislav Vaintroub	2008-09-12
      Bug#39321 - messages in recovery about exceptions from ReadFile.
      Ignore ERROR_HANDLE_EOF coming from ReadFile() It is end of file 
      and read should just return 0 like it does in Posix case.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/54804

2843 Kevin Lewis	2008-09-30
      Bug#39321 Add an exclusive lock on Database::syncScavenge in
      Database::truncateTable before the lock of Table::syncObject
      just in case the truncateTable process has to call
      Database::forceRecordScavenge.  syncScavenge must be locked
      before Table::syncObject because the scavenger does it that way.
      
      According to the Deadlock Predictor (SyncHandler.cpp), 
      syncScavenge must also be locked before Database::syncTables.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/54805

2843 Kevin Lewis	2008-09-30
      Bug#39321 Add an exclusive lock on Database::syncScavenge in
      Database::truncateTable before the lock of Table::syncObject
      just in case the truncateTable process has to call
      Database::forceRecordScavenge.  syncScavenge must be locked
      before Table::syncObject because the scavenger does it that way.
      
      According to the Deadlock Predictor (SyncHandler.cpp), 
      syncScavenge must also be locked before Database::syncTables.

A note has been added to the 6.0.8 changelog: 

When running TRUNCATE on a table where other threads are also trying to access the same Falcon table, a deadlock could occur between the two executing threads