Bug #38947 | UPDATE threads in endless Table::fetchForUpdate loop = livelock | ||
---|---|---|---|
Submitted: | 21 Aug 2008 19:14 | Modified: | 4 Oct 2008 15:06 |
Reporter: | Philip Stoev | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Falcon storage engine | Severity: | S1 (Critical) |
Version: | 6.0-falcon-team | OS: | Any |
Assigned to: | Vladislav Vaintroub | CPU Architecture: | Any |
[21 Aug 2008 19:14]
Philip Stoev
[21 Aug 2008 19:15]
Vladislav Vaintroub
as always, full stack trace is much appreciated
[22 Aug 2008 10:25]
Vladislav Vaintroub
Are you really sure it is a deadlock? How much CPU the process is using when you see that? waitForWriteComplete _can_ take long, as it goes in worst case through all committed records on each interation, but it always gives other threads chance to run by unlocking the committedTransactions.syncObject lock between iterations.
[22 Aug 2008 11:58]
Philip Stoev
For some reason gdb refuses to dump stack traces on this core file. The test suite declared a deadlock because no threads moved forward for more than one minute. The tables and the transactions that are used are not that big to justify this behavoir. If waitForWriteComplete() periodically allowed other threads to proceed, wouldn't that mean that the stalled threads printout would not print anything?
[22 Aug 2008 20:17]
Vladislav Vaintroub
>If waitForWriteComplete() periodically allowed other threads to proceed, >wouldn't that mean that the stalled threads printout would not print anything? I'm not very familiar with the printout. Looking at the code it seems like stalled is invoked after wait for 10 seconds. Which is not exactly a deadlock, but a wait. My suspicion that it is not a deadlock rather stall is because both removeCommittedTransaction(), commit() do very little in the under the committedRTransaction lock. waitForWriteComplete does more and even uses another lock syncIndexes. I searched for any case where the order of the committedTransaction/syncIndexes could be reversed and did not find it. It is certainly not commit and not removeCommittedTransaction. But lets wait until some usable callstack/core is there.
[23 Aug 2008 16:22]
Philip Stoev
Thread stacks for bug 38947
Attachment: bug38947.stacks.txt (text/plain), 42.94 KiB.
[23 Aug 2008 16:29]
Philip Stoev
Please find attached the thread stacks for this bug. Thread 9 and another thread were using up to 1.5 CPU cores. When logging to table is enabled, this will case a widespread server hang. New connections are not accepted because the loging table is locked. See also Thread 6 - It appears COM_FIELD_LIST can not be executed because StorageHandler::getStorageConnection is waiting on something. This test does not contain DDL statements, so why was this query blocked?
[23 Aug 2008 17:06]
Vladislav Vaintroub
StorageHandler::getStorageConnection() waits for StorageHandler::rollback it looks like rollback waits for one of the "hot" lists (activeTransactions, committedTransactions). I do not see any waitForWriteComplete here and callstacks do not seem to match the bug description.
[23 Aug 2008 17:17]
Vladislav Vaintroub
Also note, syncActiveTransactions list is being unlocked in Thread 11
[23 Aug 2008 17:34]
Philip Stoev
SHOW PROCESSLIST output. Note that all queries have lifetimes greater than all deadlock timeouts. In addition, the biggest table in this database contains 1000 rows, so those execution times are not normal. mysql> show processlist\G *************************** 1. row *************************** Id: 1 User: root Host: localhost:53518 db: test Command: Sleep Time: 201 State: NULL Info: NULL *************************** 2. row *************************** Id: 7 User: root Host: localhost:53524 db: test Command: Query Time: 147 State: Searching rows for update Info: UPDATE C AS X SET int_key = '62' WHERE X . int_key > '242' LIMIT 3 *************************** 3. row *************************** Id: 8 User: root Host: localhost:53525 db: test Command: Query Time: 147 State: NULL Info: START TRANSACTION *************************** 4. row *************************** Id: 9 User: root Host: localhost:53526 db: test Command: Query Time: 147 State: Searching rows for update Info: UPDATE B AS X SET int_key = '74' WHERE X . int_key < '145' LIMIT 3 *************************** 5. row *************************** Id: 10 User: root Host: localhost:53527 db: test Command: Query Time: 147 State: Searching rows for update Info: UPDATE B AS X SET int_key = '131' WHERE X . int_key < '49' LIMIT 8 *************************** 6. row *************************** Id: 11 User: root Host: localhost:53528 db: test Command: Query Time: 147 State: NULL Info: ROLLBACK *************************** 7. row *************************** Id: 12 User: root Host: localhost:53529 db: test Command: Query Time: 147 State: NULL Info: START TRANSACTION *************************** 8. row *************************** Id: 13 User: root Host: localhost:53530 db: test Command: Query Time: 97 State: Searching rows for update Info: UPDATE B AS X SET int_key = '138' WHERE X . int_key > '210' LIMIT 7 *************************** 9. row *************************** Id: 14 User: root Host: localhost:53531 db: test Command: Query Time: 147 State: NULL Info: ROLLBACK *************************** 10. row *************************** Id: 15 User: root Host: localhost:53532 db: test Command: Query Time: 147 State: Searching rows for update Info: UPDATE B AS X SET int_key = '180' WHERE X . int_key > '10' LIMIT 8 *************************** 11. row *************************** Id: 16 User: root Host: localhost:53533 db: test Command: Query Time: 97 State: Searching rows for update Info: UPDATE C AS X SET int_key = '176' WHERE X . int_key > '208' LIMIT 7 *************************** 12. row *************************** Id: 58 User: root Host: localhost db: NULL Command: Query Time: 0 State: NULL Info: show processlist 12 rows in set (0.00 sec)
[23 Aug 2008 18:03]
Philip Stoev
Grammar file for bug 38947
Attachment: bug38947.yy (application/octet-stream, text), 950 bytes.
[23 Aug 2008 18:11]
Philip Stoev
To reproduce this bug, please pull a fresh copy of mysql-test-extra-6.0 and then run: $ cd mysql-test-extra-6.0/mysql-test/gentest $ perl runall.pl \ --basedir=/build/bzr/6.0-falcon \ --grammar=conf/falcon_stall.yy \ --engine=falcon \ --queries=100000 \ --threads=10 This should deadlock as described above within 10 minutes. The stalled threads output may be different but will always contain TransactionManager::removeCommittedTransaction .
[23 Aug 2008 18:30]
Philip Stoev
What happens is that all UPDATE threads are stuck in the endless loop in Table::fetchForUpdate at Table.cpp line 3472 . 3476 if (!transaction->needToLock(record)) (gdb) print transaction->needToLock(record) $13 = true 3485 State state = transaction->getRelativeState(record, WAIT_IF_ACTIVE); (gdb) print state $14 = WasActive 3487 switch (state) # WasActive causes break (gdb) 3548 record->release(); (gdb) 3549 record = fetch(recordNumber); (gdb) print recordNumber $3 = 0 (gdb) print record $4 = (class Record *) 0xaf2d6448 and the loop repeats.
[23 Aug 2008 18:35]
Vladislav Vaintroub
Philip, since there are already so many diferent causes you suspect , could you please also change the bug description accordingly, or create a bug for each of your suspicions. - removeCommittedTransaction a cause for a deadlock (take a brief look at it) - threads are doing something so it is not a deadlock - you seen waitForWritecomplete once , but not anymore.
[23 Aug 2008 18:47]
Philip Stoev
Vlad, it is one and the same thing in all my backtraces - endless loop in Table::fetchForUpdate. I have updated the bug title accordingly. For the other things: * Falcon is unable to detect the livelock, so it reports a stall in removeCommittedTransaction, commit() and rollback() - those are the functions where the lock waits are stable and long enough to cause the stalled threads output to be printed. In waitForTransaction no individual lock is held long enough for the detector to report it. * waitForWritecomplete() shows up in the first stack trace because there is an ALTER in it. This ALTER causes this function to be called. The function then stalls and is reported by the stall detector. However no ALTERs are required to reproduce this bug -- they are not present in the test case that was uploaded.
[23 Aug 2008 19:04]
Vladislav Vaintroub
A comment in Transaction::getRelativeState says about WasActive return WasActive; // caller will need to re-fetch This is what the code is doing as I see it. The lack of timeout in fetchForUpdate() is however highly-suspect. Kevin , I give this to you for clarification. It may be so by design, although the design is in this case does not seem to be very user-friendly. Please feel free to assign back to me, but then please explain how this is designed to work.
[25 Aug 2008 2:35]
Kevin Lewis
The WasActive flag is returned from Transaction::getRelativeState() after it was necessary to wait for another active transaction to either commit or rollback. In theory, a transaction with only commit or rollback once, not multiple times. More than theory, in fact! That is why this is in a loop. Fetch for update is tying to get the most recent record version with a lock record on it. But another active transaction has a record version in front. So getRelativeState calls the waitForTransaction. Once the wait is over, he reads again. If that active transaction rolled back, then it would succeed. But if the latest committed record is newer than this transaction, the fetchForUpdate will fail. But if another transaction beats it to this record again, it will wait again. Can this poor fetchForUpdate be beeting to the record over and over again? I doupt it. There must be a code path inside getRelativeState or waitForTransaction that it hitting an unhandled condition and misinterpreting the results. I recomment debugging into those functions when this endless loop occurs.
[25 Aug 2008 15:39]
Vladislav Vaintroub
crash1
Attachment: master.err (application/octet-stream, text), 192.08 KiB.
[25 Aug 2008 15:41]
Vladislav Vaintroub
crash2
Attachment: master.err (application/octet-stream, text), 4.86 KiB.
[25 Aug 2008 18:33]
Vladislav Vaintroub
The test crashes on all my machines (x64 Windows 4 CPUs, x86 Linux 1 CPU) with the stacktraces attached. After I remove "limit" from the grammar, it does not crash but it does not hang on either machine
[26 Aug 2008 12:15]
Philip Stoev
New simpler grammar configuration for this bug: query: UPDATE table_name SET `int_key` = digit WHERE where_cond LIMIT digit | START TRANSACTION | COMMIT ; where_cond: `int_key` < digit | `int_key` > digit ; table_name: B | C | D; The tables used have 1, 20 and 100 records. With 10 threads, the livelock occurs in less than 1 minute. With 20 threads - in less than 10 seconds.
[26 Aug 2008 13:13]
Philip Stoev
New backtraces for bug 38947
Attachment: bug38947-2.backtraces.txt (text/plain), 23.84 KiB.
[26 Aug 2008 13:14]
Philip Stoev
Stalled output that goes with the latest backtraces: Stalled threads Thread 0xb71b26f8 (-1366975600) sleep=0, grant=0, locks=1, who 0, parent=0xb7189bc8 Pending Transaction::purgeTransactions state 0 (1) syncObject 0xb7190e48 Thread 0xb7317310 (-1475781744) sleep=1, grant=0, locks=1, who 0, parent=(nil) Pending Transaction::commit(3) state 0 (1) syncObject 0xb7190ddc Thread 0xb7314e48 (-1476183152) sleep=1, grant=0, locks=1, who 0, parent=(nil) Pending TransactionManager::removeCommittedTransaction state 0 (1) syncObject 0xb7190e48 Thread 0xb7316e90 (-1488508016) sleep=1, grant=0, locks=1, who 0, parent=(nil) Pending TransactionManager::removeCommittedTransaction state 0 (1) syncObject 0xb7190e48 Thread 0xb71bcb78 (-1475982448) sleep=1, grant=0, locks=1, who 0, parent=(nil) Pending Transaction::commit(2) state 0 (1) syncObject 0xb7190e48 Stalled synchronization objects: SyncObject b7190e48: state -1, readers 0, monitor 0, waiters 4 Exclusive thread b7317310 (-1475781744), type 1; Transaction::commit(3) Waiting thread b7314e48 (-1476183152), type 1; TransactionManager::removeCommittedTransaction Waiting thread b7316e90 (-1488508016), type 1; TransactionManager::removeCommittedTransaction Waiting thread b71bcb78 (-1475982448), type 1; Transaction::commit(2) Waiting thread b71b26f8 (-1366975600), type 1; Transaction::purgeTransactions SyncObject b7190ddc: state 1, readers 0, monitor 0, waiters 1 Waiting thread b7317310 (-1475781744), type 1; Transaction::commit(3)
[26 Aug 2008 15:36]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/52579 2799 Vladislav Vaintroub 2008-08-26 Bug #38947 UPDATE threads in endless Table::fetchForUpdate loop = livelock Problem: in Table::fetchForUpdate() there is a small possibility for a race condition - if record belongs to transaction that is being committed currently and the state of this transaction is still Active, but syncActive is already unlocked. This causes re-fetch() in the fetch thread without any wait, instead of waiting for falcon_lock_wait_timeout seconds. This is fixed by moving signaling waiters via syncActive.unlock(), after transaction state has changed from active to committed.
[28 Aug 2008 4:27]
Bugs System
Pushed into 6.0.7-alpha (revid:vvaintroub@mysql.com-20080826153602-n97hc033a0j0fs98) (version source revid:vvaintroub@mysql.com-20080827144354-lptt2zlg8con9d05) (pib:3)
[8 Sep 2008 21:56]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/53551 2814 Vladislav Vaintroub 2008-09-08 Bug#38947 -don't signal waiting thread until the very end of rollback(), to avoid races.
[8 Sep 2008 21:58]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/53552 2814 Vladislav Vaintroub 2008-09-08 Bug#38947 -don't signal waiting thread until the very end of rollback(), to avoid races.
[9 Sep 2008 11:46]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/53596 2815 John H. Embretsen 2008-09-09 [merge] Merging local changes into falcon-team branch. 2008-09-09 Removed redundant test case falcon_select_excerpt. http://lists.mysql.com/commits/53589 2008-09-08 Changes to the test falcon_online_index, based on review comments. http://lists.mysql.com/commits/53512 2008-09-04 Tests for WL#4048 - 'Falcon: On-line add attribute, Falcon handler part' (add/drop index). http://lists.mysql.com/commits/53284
[9 Sep 2008 13:36]
Kevin Lewis
I added an explanation for this latest patch by Vlad into Bug#22165, which is the predecessor to this bug.
[10 Sep 2008 21:42]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/53743 2824 Kevin Lewis 2008-09-10 Bug# Add an exclusive lock on Database::syncScavenge in Database::truncateTable before the lock of Table::syncObject just in case the truncateTable process has to call Database::forceRecordScavenge. syncScavenge must be locked before Table::syncObject because the scavenger does it that way. According to the Deadlock Detector, syncScavenge must also be locked before Database::syncTables.
[13 Sep 2008 21:20]
Bugs System
Pushed into 6.0.7-alpha (revid:vvaintroub@mysql.com-20080826153602-n97hc033a0j0fs98) (version source revid:hakan@mysql.com-20080725175322-8wgujj5xuzrjz3ke) (pib:3)
[4 Oct 2008 15:06]
Jon Stephens
Documented in the 6.0.7 changelog as follows: Falcon could hang trying to perform an UPDATE in one transaction while waiting for another transaction to be committed or rolled back.