Bug #38947 UPDATE threads in endless Table::fetchForUpdate loop = livelock
Submitted: 21 Aug 2008 19:14 Modified: 4 Oct 2008 15:06
Reporter: Philip Stoev Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Falcon storage engine Severity:S1 (Critical)
Version:6.0-falcon-team OS:Any
Assigned to: Vladislav Vaintroub CPU Architecture:Any

[21 Aug 2008 19:14] Philip Stoev
Description:
When executing a random query generator test containing insert/update/delete/ddl, Falcon deadlocked as follows:

# 2008-08-21 20:31:23 [24106] Stalled threads
# 2008-08-21 20:31:23 [24106]   Thread 0xb70af898 (-1368036464) sleep=0, grant=0, locks=1, who 0, parent=0xb7086bd8
# 2008-08-21 20:31:23 [24106]     Pending Transaction::commit(2) state 0 (1) syncObject 0xb708df58
# 2008-08-21 20:31:23 [24106]   Thread 0xb721d128 (-1489564784) sleep=1, grant=0, locks=1, who 0, parent=(nil)
# 2008-08-21 20:31:23 [24106]     Pending Transaction::commit(3) state 0 (1) syncObject 0xb708deec
# 2008-08-21 20:31:23 [24106]   Thread 0xb7209f78 (-1477244016) sleep=1, grant=0, locks=1, who 0, parent=(nil)
# 2008-08-21 20:31:23 [24106]     Pending TransactionManager::removeCommittedTransaction state 0 (1) syncObject 0xb708df58
# 2008-08-21 20:31:23 [24106]   Thread 0xb721cee8 (-1477645424) sleep=1, grant=0, locks=1, who 0, parent=(nil)
# 2008-08-21 20:31:23 [24106]     Pending Transaction::commit(2) state 0 (1) syncObject 0xb708df58
# 2008-08-21 20:31:23 [24106]   Thread 0xb721ca18 (-1478247536) sleep=1, grant=0, locks=1, who 0, parent=(nil)
# 2008-08-21 20:31:23 [24106]     Pending TransactionManager::removeCommittedTransaction state 0 (1) syncObject 0xb708df58
# 2008-08-21 20:31:23 [24106]   Thread 0xb7209080 (-1477043312) sleep=1, grant=0, locks=1, who 0, parent=(nil)
# 2008-08-21 20:31:23 [24106]     Pending Transaction::commit(2) state 0 (1) syncObject 0xb708df58
# 2008-08-21 20:31:23 [24106]   Thread 0xb70b9778 (-1478046832) sleep=1, grant=0, locks=1, who 0, parent=(nil)
# 2008-08-21 20:31:23 [24106]     Pending Transaction::commit(2) state 0 (1) syncObject 0xb708df58
# 2008-08-21 20:31:23 [24106]   Thread 0xb7222bd8 (-1478448240) sleep=1, grant=0, locks=1, who 0, parent=(nil)
# 2008-08-21 20:31:23 [24106]     Pending TransactionManager::waitForWriteComplete state 0 (2) syncObject 0xb708df58
# 2008-08-21 20:31:23 [24106]   Thread 0xb7205bf0 (-1477846128) sleep=1, grant=0, locks=1, who 0, parent=(nil)
# 2008-08-21 20:31:23 [24106]     Pending TransactionManager::waitForWriteComplete state 0 (2) syncObject 0xb708df58
# 2008-08-21 20:31:23 [24106] Stalled synchronization objects:
# 2008-08-21 20:31:23 [24106]   SyncObject b708df58: state -1, readers 0, monitor 0, waiters 8
# 2008-08-21 20:31:23 [24106]     Exclusive thread b721d128 (-1489564784), type 1; Transaction::commit(3)
# 2008-08-21 20:31:23 [24106]     Waiting thread b7209f78 (-1477244016), type 1; TransactionManager::removeCommittedTransaction
# 2008-08-21 20:31:23 [24106]     Waiting thread b721cee8 (-1477645424), type 1; Transaction::commit(2)
# 2008-08-21 20:31:23 [24106]     Waiting thread b721ca18 (-1478247536), type 1; TransactionManager::removeCommittedTransaction
# 2008-08-21 20:31:23 [24106]     Waiting thread b7209080 (-1477043312), type 1; Transaction::commit(2)
# 2008-08-21 20:31:23 [24106]     Waiting thread b70b9778 (-1478046832), type 1; Transaction::commit(2)
# 2008-08-21 20:31:23 [24106]     Waiting thread b7222bd8 (-1478448240), type 2; TransactionManager::waitForWriteComplete
# 2008-08-21 20:31:23 [24106]     Waiting thread b7205bf0 (-1477846128), type 2; TransactionManager::waitForWriteComplete
# 2008-08-21 20:31:23 [24106]     Waiting thread b70af898 (-1368036464), type 1; Transaction::commit(2)
# 2008-08-21 20:31:23 [24106]   SyncObject b708deec: state 1, readers 0, monitor 0, waiters 1
# 2008-08-21 20:31:23 [24106]     Waiting thread b721d128 (-1489564784), type 1; Transaction::commit(3)
# 2008-08-21 20:31:23 [24106] ------------------------------------

How to repeat:
This bug was triggered by the falcon_online_alter random query generator test. It may be possible to reproduce it by running the same test manually, with the following command line:

 runall.pl \ 
 --basedir=/export/home/pb2/test/sourcebuilder-build-4307-1219343390.03/mysql-6.0.7-alpha-linux-i686-test \ 
 --vardir=/export/home/pb2/test/sourcebuilder-build-4307-1219343390.03/mysql-6.0.7-alpha-linux-i686-test/vardirs \ 
 --engine=Falcon \ 
 --grammar=conf/falcon_online_alter.yy \ 
 --threads=10 \ 
 --queries=10000
[21 Aug 2008 19:15] Vladislav Vaintroub
as always, full stack trace is much appreciated
[22 Aug 2008 10:25] Vladislav Vaintroub
Are you really sure it is a deadlock? How much CPU the process is using when you see that?  waitForWriteComplete _can_ take long, as it goes in worst case through all committed records on each interation, but it always gives other threads chance to run by unlocking the committedTransactions.syncObject lock between iterations.
[22 Aug 2008 11:58] Philip Stoev
For some reason gdb refuses to dump stack traces on this core file.

The test suite declared a deadlock because no threads moved forward for more than one minute. The tables and the transactions that are used are not that big to justify this behavoir.

If waitForWriteComplete() periodically allowed other threads to proceed, wouldn't that mean that the stalled threads printout would not print anything?
[22 Aug 2008 20:17] Vladislav Vaintroub
>If waitForWriteComplete() periodically allowed other threads to proceed, >wouldn't that mean that the stalled threads printout would not print anything?

I'm not very familiar with the printout. Looking at the code it seems like stalled is invoked after wait for 10 seconds. Which is not exactly a deadlock, but a wait.

My suspicion that it is not a deadlock rather stall is because both
removeCommittedTransaction(), commit() do very little in the under the committedRTransaction lock. waitForWriteComplete does more and even uses another lock syncIndexes. I searched for any case where the order of the committedTransaction/syncIndexes could be reversed and did not find it. It is certainly not commit and not removeCommittedTransaction. But lets wait until some usable callstack/core is there.
[23 Aug 2008 16:22] Philip Stoev
Thread stacks for bug 38947

Attachment: bug38947.stacks.txt (text/plain), 42.94 KiB.

[23 Aug 2008 16:29] Philip Stoev
Please find attached the thread stacks for this bug. Thread 9 and another thread were using up to 1.5 CPU cores.

When logging to table is enabled, this will case a widespread server hang. New connections are not accepted because the loging table is locked.

See also Thread 6 - It appears COM_FIELD_LIST can not be executed because StorageHandler::getStorageConnection is waiting on something. This test does not contain DDL statements, so why was this query blocked?
[23 Aug 2008 17:06] Vladislav Vaintroub
StorageHandler::getStorageConnection()  waits for StorageHandler::rollback
it looks like rollback waits for one of the "hot" lists (activeTransactions, committedTransactions).  I do not see any waitForWriteComplete here and callstacks do not seem to match the bug description.
[23 Aug 2008 17:17] Vladislav Vaintroub
Also note, syncActiveTransactions list is being unlocked in Thread 11
[23 Aug 2008 17:34] Philip Stoev
SHOW PROCESSLIST output. Note that all queries have lifetimes greater than all deadlock timeouts. In addition, the biggest table in this database contains 1000 rows, so those execution times are not normal.

mysql> show processlist\G
*************************** 1. row ***************************
     Id: 1
   User: root
   Host: localhost:53518
     db: test
Command: Sleep
   Time: 201
  State: NULL
   Info: NULL
*************************** 2. row ***************************
     Id: 7
   User: root
   Host: localhost:53524
     db: test
Command: Query
   Time: 147
  State: Searching rows for update
   Info: UPDATE C AS X SET int_key = '62' WHERE X . int_key > '242' LIMIT 3
*************************** 3. row ***************************
     Id: 8
   User: root
   Host: localhost:53525
     db: test
Command: Query
   Time: 147
  State: NULL
   Info: START TRANSACTION
*************************** 4. row ***************************
     Id: 9
   User: root
   Host: localhost:53526
     db: test
Command: Query
   Time: 147
  State: Searching rows for update
   Info: UPDATE B AS X SET int_key = '74' WHERE X . int_key < '145' LIMIT 3
*************************** 5. row ***************************
     Id: 10
   User: root
   Host: localhost:53527
     db: test
Command: Query
   Time: 147
  State: Searching rows for update
   Info: UPDATE B AS X SET int_key = '131' WHERE X . int_key < '49' LIMIT 8
*************************** 6. row ***************************
     Id: 11
   User: root
   Host: localhost:53528
     db: test
Command: Query
   Time: 147
  State: NULL
   Info: ROLLBACK
*************************** 7. row ***************************
     Id: 12
   User: root
   Host: localhost:53529
     db: test
Command: Query
   Time: 147
  State: NULL
   Info: START TRANSACTION
*************************** 8. row ***************************
     Id: 13
   User: root
   Host: localhost:53530
     db: test
Command: Query
   Time: 97
  State: Searching rows for update
   Info: UPDATE B AS X SET int_key = '138' WHERE X . int_key > '210' LIMIT 7
*************************** 9. row ***************************
     Id: 14
   User: root
   Host: localhost:53531
     db: test
Command: Query
   Time: 147
  State: NULL
   Info: ROLLBACK
*************************** 10. row ***************************
     Id: 15
   User: root
   Host: localhost:53532
     db: test
Command: Query
   Time: 147
  State: Searching rows for update
   Info: UPDATE B AS X SET int_key = '180' WHERE X . int_key > '10' LIMIT 8
*************************** 11. row ***************************
     Id: 16
   User: root
   Host: localhost:53533
     db: test
Command: Query
   Time: 97
  State: Searching rows for update
   Info: UPDATE C AS X SET int_key = '176' WHERE X . int_key > '208' LIMIT 7
*************************** 12. row ***************************
     Id: 58
   User: root
   Host: localhost
     db: NULL
Command: Query
   Time: 0
  State: NULL
   Info: show processlist
12 rows in set (0.00 sec)
[23 Aug 2008 18:03] Philip Stoev
Grammar file for bug 38947

Attachment: bug38947.yy (application/octet-stream, text), 950 bytes.

[23 Aug 2008 18:11] Philip Stoev
To reproduce this bug, please pull a fresh copy of mysql-test-extra-6.0 and then run:

$ cd mysql-test-extra-6.0/mysql-test/gentest
$ perl runall.pl \
  --basedir=/build/bzr/6.0-falcon \
  --grammar=conf/falcon_stall.yy \
  --engine=falcon \
  --queries=100000 \
  --threads=10

This should deadlock as described above within 10 minutes. The stalled threads output may be different but will always contain TransactionManager::removeCommittedTransaction .
[23 Aug 2008 18:30] Philip Stoev
What happens is that all UPDATE threads are stuck in the endless loop in Table::fetchForUpdate at Table.cpp line 3472 .

3476                    if (!transaction->needToLock(record))
(gdb) print transaction->needToLock(record)
$13 = true
3485                    State state = transaction->getRelativeState(record, WAIT_IF_ACTIVE);
(gdb) print state
$14 = WasActive
3487                    switch (state) # WasActive causes break
(gdb)
3548                    record->release();
(gdb)
3549                    record = fetch(recordNumber);
(gdb) print recordNumber
$3 = 0
(gdb) print record
$4 = (class Record *) 0xaf2d6448

and the loop repeats.
[23 Aug 2008 18:35] Vladislav Vaintroub
Philip, since there are already so many diferent causes you suspect , could you please also change the bug description accordingly, or create a bug for each of your suspicions.

- removeCommittedTransaction a cause for a deadlock (take a brief look at it)
- threads are doing something so it is not a deadlock
- you seen waitForWritecomplete once , but not anymore.
[23 Aug 2008 18:47] Philip Stoev
Vlad, it is one and the same thing in all my backtraces - endless loop in Table::fetchForUpdate. I have updated the bug title accordingly. For the other things:

* Falcon is unable to detect the livelock, so it reports a stall in removeCommittedTransaction, commit() and rollback() - those are the functions where the lock waits are stable and long enough to cause the stalled threads output to be printed. In waitForTransaction no individual lock is held long enough for the detector to report it.

* waitForWritecomplete() shows up in the first stack trace because there is an ALTER in it. This ALTER causes this function to be called. The function then stalls and is reported by the stall detector. However no ALTERs are required to reproduce this bug -- they are not present in the test case that was uploaded.
[23 Aug 2008 19:04] Vladislav Vaintroub
A comment in Transaction::getRelativeState says about WasActive
return WasActive;			// caller will need to re-fetch

This is what the code is doing as I see it. The lack of timeout in fetchForUpdate() is however highly-suspect.

Kevin , I give this to you for clarification.  It may be so by design, although the design is in this case does not seem to be very user-friendly. Please feel free to assign back to me, but then please explain how this is designed to work.
[25 Aug 2008 2:35] Kevin Lewis
The WasActive flag is returned from Transaction::getRelativeState() after it was necessary to wait for another active transaction to either commit or rollback.  In theory, a transaction with only commit or rollback once, not multiple times.  More than theory, in fact!  

That is why this is in a loop.  Fetch for update is tying to get the most recent record version with a lock record on it.  But another active transaction has a record version in front.  So getRelativeState calls the waitForTransaction.  Once the wait is over, he reads again.  If that active transaction rolled back, then it would succeed.  But if the latest committed record is newer than this transaction, the fetchForUpdate will fail.  But if another transaction beats it to this record again, it will wait again.  Can this poor fetchForUpdate be beeting to the record over and over again?  I doupt it.  

There must be a code path inside getRelativeState or waitForTransaction that it hitting an unhandled condition and misinterpreting the results.  I recomment debugging into those functions when this endless loop occurs.
[25 Aug 2008 15:39] Vladislav Vaintroub
crash1

Attachment: master.err (application/octet-stream, text), 192.08 KiB.

[25 Aug 2008 15:41] Vladislav Vaintroub
crash2

Attachment: master.err (application/octet-stream, text), 4.86 KiB.

[25 Aug 2008 18:33] Vladislav Vaintroub
The test crashes on all my machines (x64 Windows 4 CPUs, x86 Linux 1 CPU) with the stacktraces attached. After I remove "limit" from the grammar, it does not crash but it does not hang on either machine
[26 Aug 2008 12:15] Philip Stoev
New simpler grammar configuration for this bug:

query:
        UPDATE table_name SET `int_key` = digit WHERE where_cond LIMIT digit |
        START TRANSACTION |
        COMMIT ;

where_cond:
        `int_key` < digit | `int_key` > digit ;

table_name:
        B | C | D;

The tables used have 1, 20 and 100 records. With 10 threads, the livelock occurs in less than 1 minute. With 20 threads - in less than 10 seconds.
[26 Aug 2008 13:13] Philip Stoev
New backtraces for bug 38947

Attachment: bug38947-2.backtraces.txt (text/plain), 23.84 KiB.

[26 Aug 2008 13:14] Philip Stoev
Stalled output that goes with the latest backtraces:

Stalled threads
  Thread 0xb71b26f8 (-1366975600) sleep=0, grant=0, locks=1, who 0, parent=0xb7189bc8
    Pending Transaction::purgeTransactions state 0 (1) syncObject 0xb7190e48
  Thread 0xb7317310 (-1475781744) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending Transaction::commit(3) state 0 (1) syncObject 0xb7190ddc
  Thread 0xb7314e48 (-1476183152) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending TransactionManager::removeCommittedTransaction state 0 (1) syncObject 0xb7190e48
  Thread 0xb7316e90 (-1488508016) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending TransactionManager::removeCommittedTransaction state 0 (1) syncObject 0xb7190e48
  Thread 0xb71bcb78 (-1475982448) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending Transaction::commit(2) state 0 (1) syncObject 0xb7190e48
Stalled synchronization objects:
  SyncObject b7190e48: state -1, readers 0, monitor 0, waiters 4
    Exclusive thread b7317310 (-1475781744), type 1; Transaction::commit(3)
    Waiting thread b7314e48 (-1476183152), type 1; TransactionManager::removeCommittedTransaction
    Waiting thread b7316e90 (-1488508016), type 1; TransactionManager::removeCommittedTransaction
    Waiting thread b71bcb78 (-1475982448), type 1; Transaction::commit(2)
    Waiting thread b71b26f8 (-1366975600), type 1; Transaction::purgeTransactions
  SyncObject b7190ddc: state 1, readers 0, monitor 0, waiters 1
    Waiting thread b7317310 (-1475781744), type 1; Transaction::commit(3)
[26 Aug 2008 15:36] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/52579

2799 Vladislav Vaintroub	2008-08-26
      Bug #38947 UPDATE threads in endless Table::fetchForUpdate loop = livelock 
      
      Problem: in Table::fetchForUpdate() there is a small possibility for a 
      race condition -  if record belongs to transaction that is being committed
      currently and the state of this transaction is still Active, but syncActive
      is already unlocked. This causes re-fetch() in the fetch thread without 
      any wait, instead of waiting for falcon_lock_wait_timeout seconds.
      
      This is fixed by moving signaling  waiters via syncActive.unlock(), 
      after transaction state has changed from active to committed.
[28 Aug 2008 4:27] Bugs System
Pushed into 6.0.7-alpha  (revid:vvaintroub@mysql.com-20080826153602-n97hc033a0j0fs98) (version source revid:vvaintroub@mysql.com-20080827144354-lptt2zlg8con9d05) (pib:3)
[8 Sep 2008 21:56] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/53551

2814 Vladislav Vaintroub	2008-09-08
      Bug#38947 -don't signal waiting thread until the very end of rollback(), to avoid races.
[8 Sep 2008 21:58] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/53552

2814 Vladislav Vaintroub	2008-09-08
      Bug#38947 -don't signal waiting thread until the very end of rollback(), to avoid races.
[9 Sep 2008 11:46] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/53596

2815 John H. Embretsen	2008-09-09 [merge]
      Merging local changes into falcon-team branch.
      
      2008-09-09 Removed redundant test case falcon_select_excerpt.
      http://lists.mysql.com/commits/53589
      2008-09-08 Changes to the test falcon_online_index, based on review comments.
      http://lists.mysql.com/commits/53512
      2008-09-04 Tests for WL#4048 - 'Falcon: On-line add attribute, Falcon handler part' (add/drop index).
      http://lists.mysql.com/commits/53284
[9 Sep 2008 13:36] Kevin Lewis
I added an explanation for this latest patch by Vlad into Bug#22165, which is the predecessor to this bug.
[10 Sep 2008 21:42] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/53743

2824 Kevin Lewis	2008-09-10
      Bug# Add an exclusive lock on Database::syncScavenge in 
      Database::truncateTable before the lock of Table::syncObject
      just in case the truncateTable process has to call 
      Database::forceRecordScavenge.  syncScavenge must be locked 
      before Table::syncObject because the scavenger does it that way.
      
      According to the Deadlock Detector, syncScavenge must also be 
      locked before Database::syncTables.
[13 Sep 2008 21:20] Bugs System
Pushed into 6.0.7-alpha  (revid:vvaintroub@mysql.com-20080826153602-n97hc033a0j0fs98) (version source revid:hakan@mysql.com-20080725175322-8wgujj5xuzrjz3ke) (pib:3)
[4 Oct 2008 15:06] Jon Stephens
Documented in the 6.0.7 changelog as follows:

        Falcon could hang trying to perform an UPDATE in one transaction while
        waiting for another transaction to be committed or rolled back.