MySQL Bugs: #38947: UPDATE threads in endless Table::fetchForUpdate loop = livelock

Bug #38947	UPDATE threads in endless Table::fetchForUpdate loop = livelock
Submitted:	21 Aug 2008 19:14	Modified:	4 Oct 2008 15:06
Reporter:	Philip Stoev	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S1 (Critical)
Version:	6.0-falcon-team	OS:	Any
Assigned to:	Vladislav Vaintroub	CPU Architecture:	Any

Description:
When executing a random query generator test containing insert/update/delete/ddl, Falcon deadlocked as follows:

# 2008-08-21 20:31:23 [24106] Stalled threads
# 2008-08-21 20:31:23 [24106]   Thread 0xb70af898 (-1368036464) sleep=0, grant=0, locks=1, who 0, parent=0xb7086bd8
# 2008-08-21 20:31:23 [24106]     Pending Transaction::commit(2) state 0 (1) syncObject 0xb708df58
# 2008-08-21 20:31:23 [24106]   Thread 0xb721d128 (-1489564784) sleep=1, grant=0, locks=1, who 0, parent=(nil)
# 2008-08-21 20:31:23 [24106]     Pending Transaction::commit(3) state 0 (1) syncObject 0xb708deec
# 2008-08-21 20:31:23 [24106]   Thread 0xb7209f78 (-1477244016) sleep=1, grant=0, locks=1, who 0, parent=(nil)
# 2008-08-21 20:31:23 [24106]     Pending TransactionManager::removeCommittedTransaction state 0 (1) syncObject 0xb708df58
# 2008-08-21 20:31:23 [24106]   Thread 0xb721cee8 (-1477645424) sleep=1, grant=0, locks=1, who 0, parent=(nil)
# 2008-08-21 20:31:23 [24106]     Pending Transaction::commit(2) state 0 (1) syncObject 0xb708df58
# 2008-08-21 20:31:23 [24106]   Thread 0xb721ca18 (-1478247536) sleep=1, grant=0, locks=1, who 0, parent=(nil)
# 2008-08-21 20:31:23 [24106]     Pending TransactionManager::removeCommittedTransaction state 0 (1) syncObject 0xb708df58
# 2008-08-21 20:31:23 [24106]   Thread 0xb7209080 (-1477043312) sleep=1, grant=0, locks=1, who 0, parent=(nil)
# 2008-08-21 20:31:23 [24106]     Pending Transaction::commit(2) state 0 (1) syncObject 0xb708df58
# 2008-08-21 20:31:23 [24106]   Thread 0xb70b9778 (-1478046832) sleep=1, grant=0, locks=1, who 0, parent=(nil)
# 2008-08-21 20:31:23 [24106]     Pending Transaction::commit(2) state 0 (1) syncObject 0xb708df58
# 2008-08-21 20:31:23 [24106]   Thread 0xb7222bd8 (-1478448240) sleep=1, grant=0, locks=1, who 0, parent=(nil)
# 2008-08-21 20:31:23 [24106]     Pending TransactionManager::waitForWriteComplete state 0 (2) syncObject 0xb708df58
# 2008-08-21 20:31:23 [24106]   Thread 0xb7205bf0 (-1477846128) sleep=1, grant=0, locks=1, who 0, parent=(nil)
# 2008-08-21 20:31:23 [24106]     Pending TransactionManager::waitForWriteComplete state 0 (2) syncObject 0xb708df58
# 2008-08-21 20:31:23 [24106] Stalled synchronization objects:
# 2008-08-21 20:31:23 [24106]   SyncObject b708df58: state -1, readers 0, monitor 0, waiters 8
# 2008-08-21 20:31:23 [24106]     Exclusive thread b721d128 (-1489564784), type 1; Transaction::commit(3)
# 2008-08-21 20:31:23 [24106]     Waiting thread b7209f78 (-1477244016), type 1; TransactionManager::removeCommittedTransaction
# 2008-08-21 20:31:23 [24106]     Waiting thread b721cee8 (-1477645424), type 1; Transaction::commit(2)
# 2008-08-21 20:31:23 [24106]     Waiting thread b721ca18 (-1478247536), type 1; TransactionManager::removeCommittedTransaction
# 2008-08-21 20:31:23 [24106]     Waiting thread b7209080 (-1477043312), type 1; Transaction::commit(2)
# 2008-08-21 20:31:23 [24106]     Waiting thread b70b9778 (-1478046832), type 1; Transaction::commit(2)
# 2008-08-21 20:31:23 [24106]     Waiting thread b7222bd8 (-1478448240), type 2; TransactionManager::waitForWriteComplete
# 2008-08-21 20:31:23 [24106]     Waiting thread b7205bf0 (-1477846128), type 2; TransactionManager::waitForWriteComplete
# 2008-08-21 20:31:23 [24106]     Waiting thread b70af898 (-1368036464), type 1; Transaction::commit(2)
# 2008-08-21 20:31:23 [24106]   SyncObject b708deec: state 1, readers 0, monitor 0, waiters 1
# 2008-08-21 20:31:23 [24106]     Waiting thread b721d128 (-1489564784), type 1; Transaction::commit(3)
# 2008-08-21 20:31:23 [24106] ------------------------------------

How to repeat:
This bug was triggered by the falcon_online_alter random query generator test. It may be possible to reproduce it by running the same test manually, with the following command line:

 runall.pl \ 
 --basedir=/export/home/pb2/test/sourcebuilder-build-4307-1219343390.03/mysql-6.0.7-alpha-linux-i686-test \ 
 --vardir=/export/home/pb2/test/sourcebuilder-build-4307-1219343390.03/mysql-6.0.7-alpha-linux-i686-test/vardirs \ 
 --engine=Falcon \ 
 --grammar=conf/falcon_online_alter.yy \ 
 --threads=10 \ 
 --queries=10000

as always, full stack trace is much appreciated

Are you really sure it is a deadlock? How much CPU the process is using when you see that?  waitForWriteComplete _can_ take long, as it goes in worst case through all committed records on each interation, but it always gives other threads chance to run by unlocking the committedTransactions.syncObject lock between iterations.

For some reason gdb refuses to dump stack traces on this core file.

The test suite declared a deadlock because no threads moved forward for more than one minute. The tables and the transactions that are used are not that big to justify this behavoir.

If waitForWriteComplete() periodically allowed other threads to proceed, wouldn't that mean that the stalled threads printout would not print anything?

>If waitForWriteComplete() periodically allowed other threads to proceed, >wouldn't that mean that the stalled threads printout would not print anything?

I'm not very familiar with the printout. Looking at the code it seems like stalled is invoked after wait for 10 seconds. Which is not exactly a deadlock, but a wait.

My suspicion that it is not a deadlock rather stall is because both
removeCommittedTransaction(), commit() do very little in the under the committedRTransaction lock. waitForWriteComplete does more and even uses another lock syncIndexes. I searched for any case where the order of the committedTransaction/syncIndexes could be reversed and did not find it. It is certainly not commit and not removeCommittedTransaction. But lets wait until some usable callstack/core is there.

Thread stacks for bug 38947

Attachment: bug38947.stacks.txt (text/plain), 42.94 KiB.

Please find attached the thread stacks for this bug. Thread 9 and another thread were using up to 1.5 CPU cores.

When logging to table is enabled, this will case a widespread server hang. New connections are not accepted because the loging table is locked.

See also Thread 6 - It appears COM_FIELD_LIST can not be executed because StorageHandler::getStorageConnection is waiting on something. This test does not contain DDL statements, so why was this query blocked?

StorageHandler::getStorageConnection()  waits for StorageHandler::rollback
it looks like rollback waits for one of the "hot" lists (activeTransactions, committedTransactions).  I do not see any waitForWriteComplete here and callstacks do not seem to match the bug description.

Also note, syncActiveTransactions list is being unlocked in Thread 11

SHOW PROCESSLIST output. Note that all queries have lifetimes greater than all deadlock timeouts. In addition, the biggest table in this database contains 1000 rows, so those execution times are not normal.

mysql> show processlist\G
*************************** 1. row ***************************
     Id: 1
   User: root
   Host: localhost:53518
     db: test
Command: Sleep
   Time: 201
  State: NULL
   Info: NULL
*************************** 2. row ***************************
     Id: 7
   User: root
   Host: localhost:53524
     db: test
Command: Query
   Time: 147
  State: Searching rows for update
   Info: UPDATE C AS X SET int_key = '62' WHERE X . int_key > '242' LIMIT 3
*************************** 3. row ***************************
     Id: 8
   User: root
   Host: localhost:53525
     db: test
Command: Query
   Time: 147
  State: NULL
   Info: START TRANSACTION
*************************** 4. row ***************************
     Id: 9
   User: root
   Host: localhost:53526
     db: test
Command: Query
   Time: 147
  State: Searching rows for update
   Info: UPDATE B AS X SET int_key = '74' WHERE X . int_key < '145' LIMIT 3
*************************** 5. row ***************************
     Id: 10
   User: root
   Host: localhost:53527
     db: test
Command: Query
   Time: 147
  State: Searching rows for update
   Info: UPDATE B AS X SET int_key = '131' WHERE X . int_key < '49' LIMIT 8
*************************** 6. row ***************************
     Id: 11
   User: root
   Host: localhost:53528
     db: test
Command: Query
   Time: 147
  State: NULL
   Info: ROLLBACK
*************************** 7. row ***************************
     Id: 12
   User: root
   Host: localhost:53529
     db: test
Command: Query
   Time: 147
  State: NULL
   Info: START TRANSACTION
*************************** 8. row ***************************
     Id: 13
   User: root
   Host: localhost:53530
     db: test
Command: Query
   Time: 97
  State: Searching rows for update
   Info: UPDATE B AS X SET int_key = '138' WHERE X . int_key > '210' LIMIT 7
*************************** 9. row ***************************
     Id: 14
   User: root
   Host: localhost:53531
     db: test
Command: Query
   Time: 147
  State: NULL
   Info: ROLLBACK
*************************** 10. row ***************************
     Id: 15
   User: root
   Host: localhost:53532
     db: test
Command: Query
   Time: 147
  State: Searching rows for update
   Info: UPDATE B AS X SET int_key = '180' WHERE X . int_key > '10' LIMIT 8
*************************** 11. row ***************************
     Id: 16
   User: root
   Host: localhost:53533
     db: test
Command: Query
   Time: 97
  State: Searching rows for update
   Info: UPDATE C AS X SET int_key = '176' WHERE X . int_key > '208' LIMIT 7
*************************** 12. row ***************************
     Id: 58
   User: root
   Host: localhost
     db: NULL
Command: Query
   Time: 0
  State: NULL
   Info: show processlist
12 rows in set (0.00 sec)

Grammar file for bug 38947

Attachment: bug38947.yy (application/octet-stream, text), 950 bytes.

To reproduce this bug, please pull a fresh copy of mysql-test-extra-6.0 and then run:

$ cd mysql-test-extra-6.0/mysql-test/gentest
$ perl runall.pl \
  --basedir=/build/bzr/6.0-falcon \
  --grammar=conf/falcon_stall.yy \
  --engine=falcon \
  --queries=100000 \
  --threads=10

This should deadlock as described above within 10 minutes. The stalled threads output may be different but will always contain TransactionManager::removeCommittedTransaction .

What happens is that all UPDATE threads are stuck in the endless loop in Table::fetchForUpdate at Table.cpp line 3472 .

3476                    if (!transaction->needToLock(record))
(gdb) print transaction->needToLock(record)
$13 = true
3485                    State state = transaction->getRelativeState(record, WAIT_IF_ACTIVE);
(gdb) print state
$14 = WasActive
3487                    switch (state) # WasActive causes break
(gdb)
3548                    record->release();
(gdb)
3549                    record = fetch(recordNumber);
(gdb) print recordNumber
$3 = 0
(gdb) print record
$4 = (class Record *) 0xaf2d6448

and the loop repeats.

Philip, since there are already so many diferent causes you suspect , could you please also change the bug description accordingly, or create a bug for each of your suspicions.

- removeCommittedTransaction a cause for a deadlock (take a brief look at it)
- threads are doing something so it is not a deadlock
- you seen waitForWritecomplete once , but not anymore.

Vlad, it is one and the same thing in all my backtraces - endless loop in Table::fetchForUpdate. I have updated the bug title accordingly. For the other things:

* Falcon is unable to detect the livelock, so it reports a stall in removeCommittedTransaction, commit() and rollback() - those are the functions where the lock waits are stable and long enough to cause the stalled threads output to be printed. In waitForTransaction no individual lock is held long enough for the detector to report it.

* waitForWritecomplete() shows up in the first stack trace because there is an ALTER in it. This ALTER causes this function to be called. The function then stalls and is reported by the stall detector. However no ALTERs are required to reproduce this bug -- they are not present in the test case that was uploaded.

A comment in Transaction::getRelativeState says about WasActive
return WasActive;			// caller will need to re-fetch

This is what the code is doing as I see it. The lack of timeout in fetchForUpdate() is however highly-suspect.

Kevin , I give this to you for clarification.  It may be so by design, although the design is in this case does not seem to be very user-friendly. Please feel free to assign back to me, but then please explain how this is designed to work.

The WasActive flag is returned from Transaction::getRelativeState() after it was necessary to wait for another active transaction to either commit or rollback.  In theory, a transaction with only commit or rollback once, not multiple times.  More than theory, in fact!  

That is why this is in a loop.  Fetch for update is tying to get the most recent record version with a lock record on it.  But another active transaction has a record version in front.  So getRelativeState calls the waitForTransaction.  Once the wait is over, he reads again.  If that active transaction rolled back, then it would succeed.  But if the latest committed record is newer than this transaction, the fetchForUpdate will fail.  But if another transaction beats it to this record again, it will wait again.  Can this poor fetchForUpdate be beeting to the record over and over again?  I doupt it.  

There must be a code path inside getRelativeState or waitForTransaction that it hitting an unhandled condition and misinterpreting the results.  I recomment debugging into those functions when this endless loop occurs.

crash1

Attachment: master.err (application/octet-stream, text), 192.08 KiB.

crash2

Attachment: master.err (application/octet-stream, text), 4.86 KiB.

The test crashes on all my machines (x64 Windows 4 CPUs, x86 Linux 1 CPU) with the stacktraces attached. After I remove "limit" from the grammar, it does not crash but it does not hang on either machine

New simpler grammar configuration for this bug:

query:
        UPDATE table_name SET `int_key` = digit WHERE where_cond LIMIT digit |
        START TRANSACTION |
        COMMIT ;

where_cond:
        `int_key` < digit | `int_key` > digit ;

table_name:
        B | C | D;

The tables used have 1, 20 and 100 records. With 10 threads, the livelock occurs in less than 1 minute. With 20 threads - in less than 10 seconds.

New backtraces for bug 38947

Attachment: bug38947-2.backtraces.txt (text/plain), 23.84 KiB.

Stalled output that goes with the latest backtraces:

Stalled threads
  Thread 0xb71b26f8 (-1366975600) sleep=0, grant=0, locks=1, who 0, parent=0xb7189bc8
    Pending Transaction::purgeTransactions state 0 (1) syncObject 0xb7190e48
  Thread 0xb7317310 (-1475781744) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending Transaction::commit(3) state 0 (1) syncObject 0xb7190ddc
  Thread 0xb7314e48 (-1476183152) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending TransactionManager::removeCommittedTransaction state 0 (1) syncObject 0xb7190e48
  Thread 0xb7316e90 (-1488508016) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending TransactionManager::removeCommittedTransaction state 0 (1) syncObject 0xb7190e48
  Thread 0xb71bcb78 (-1475982448) sleep=1, grant=0, locks=1, who 0, parent=(nil)
    Pending Transaction::commit(2) state 0 (1) syncObject 0xb7190e48
Stalled synchronization objects:
  SyncObject b7190e48: state -1, readers 0, monitor 0, waiters 4
    Exclusive thread b7317310 (-1475781744), type 1; Transaction::commit(3)
    Waiting thread b7314e48 (-1476183152), type 1; TransactionManager::removeCommittedTransaction
    Waiting thread b7316e90 (-1488508016), type 1; TransactionManager::removeCommittedTransaction
    Waiting thread b71bcb78 (-1475982448), type 1; Transaction::commit(2)
    Waiting thread b71b26f8 (-1366975600), type 1; Transaction::purgeTransactions
  SyncObject b7190ddc: state 1, readers 0, monitor 0, waiters 1
    Waiting thread b7317310 (-1475781744), type 1; Transaction::commit(3)

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/52579

2799 Vladislav Vaintroub	2008-08-26
      Bug #38947 UPDATE threads in endless Table::fetchForUpdate loop = livelock 
      
      Problem: in Table::fetchForUpdate() there is a small possibility for a 
      race condition -  if record belongs to transaction that is being committed
      currently and the state of this transaction is still Active, but syncActive
      is already unlocked. This causes re-fetch() in the fetch thread without 
      any wait, instead of waiting for falcon_lock_wait_timeout seconds.
      
      This is fixed by moving signaling  waiters via syncActive.unlock(), 
      after transaction state has changed from active to committed.

Pushed into 6.0.7-alpha  (revid:vvaintroub@mysql.com-20080826153602-n97hc033a0j0fs98) (version source revid:vvaintroub@mysql.com-20080827144354-lptt2zlg8con9d05) (pib:3)

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/53551

2814 Vladislav Vaintroub	2008-09-08
      Bug#38947 -don't signal waiting thread until the very end of rollback(), to avoid races.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/53552

2814 Vladislav Vaintroub	2008-09-08
      Bug#38947 -don't signal waiting thread until the very end of rollback(), to avoid races.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/53596

2815 John H. Embretsen	2008-09-09 [merge]
      Merging local changes into falcon-team branch.
      
      2008-09-09 Removed redundant test case falcon_select_excerpt.
      http://lists.mysql.com/commits/53589
      2008-09-08 Changes to the test falcon_online_index, based on review comments.
      http://lists.mysql.com/commits/53512
      2008-09-04 Tests for WL#4048 - 'Falcon: On-line add attribute, Falcon handler part' (add/drop index).
      http://lists.mysql.com/commits/53284

I added an explanation for this latest patch by Vlad into Bug#22165, which is the predecessor to this bug.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/53743

2824 Kevin Lewis	2008-09-10
      Bug# Add an exclusive lock on Database::syncScavenge in 
      Database::truncateTable before the lock of Table::syncObject
      just in case the truncateTable process has to call 
      Database::forceRecordScavenge.  syncScavenge must be locked 
      before Table::syncObject because the scavenger does it that way.
      
      According to the Deadlock Detector, syncScavenge must also be 
      locked before Database::syncTables.

Pushed into 6.0.7-alpha  (revid:vvaintroub@mysql.com-20080826153602-n97hc033a0j0fs98) (version source revid:hakan@mysql.com-20080725175322-8wgujj5xuzrjz3ke) (pib:3)

Documented in the 6.0.7 changelog as follows:

        Falcon could hang trying to perform an UPDATE in one transaction while
        waiting for another transaction to be committed or rolled back.