Bug #37251 | Livelock between UPDATE and DELETE threads | ||
---|---|---|---|
Submitted: | 6 Jun 2008 19:26 | Modified: | 8 Jan 2009 10:34 |
Reporter: | Philip Stoev | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Falcon storage engine | Severity: | S2 (Serious) |
Version: | 6.0-falcon-team | OS: | Any |
Assigned to: | Vladislav Vaintroub | CPU Architecture: | Any |
[6 Jun 2008 19:26]
Philip Stoev
[6 Jun 2008 19:27]
Philip Stoev
* The backtraces are: #0 0x08419d25 in Transaction::getRelativeState (this=0xa7da4b78, record=0xa903a0f8, flags=<value optimized out>) at Transaction.cpp:870 #1 0x0840e95e in Table::fetchForUpdate (this=0xb74049a8, transaction=0xa7da4b78, source=0xa903a0f8, usingIndex=true) at Table.cpp:3474 #2 0x083fbdf6 in StorageDatabase::nextIndexed (this=0xb709a118, storageTable=0xb62057b8, recordBitmap=0xa7e0a860, recordNumber=7042, lockForUpdate=true) at StorageDatabase.cpp:426 #3 0x08400cd4 in StorageTable::nextIndexed (this=0x1, recordNumber=7041, lockForUpdate=<value optimized out>) at StorageTable.cpp:182 #4 0x083f6616 in StorageInterface::index_next (this=0xa834da8, buf=0xa834f70 "Щ") at ha_falcon.cpp:1581 #5 0x083f37bf in StorageInterface::index_next_same (this=0xa834da8, buf=0xa834f70 "Щ", key=0xa5c16a8 "", key_len=5) at ha_falcon.cpp:1634 #6 0x08349d18 in handler::read_range_next (this=0xa834da8) at handler.cc:4943 #7 0x083463f4 in handler::multi_range_read_next (this=0xa834da8, range_info=0xa9839abc) at handler.cc:4241 #8 0x08329cc9 in QUICK_RANGE_SELECT::get_next (this=0xa5b76d0) at opt_range.cc:8508 #9 0x083424bd in rr_quick (info=0xa9839d44) at records.cc:298 #10 0x08344872 in filesort (thd=0xa9b3c9f8, table=0xa647820, sortorder=0xa5b41e8, s_length=1, select=0xa5b4108, max_rows=18446744073709551615, sort_positions=true, examined_rows=0xa983a190) at filesort.cc:520 #11 0x082eed83 in mysql_delete (thd=0xa9b3c9f8, table_list=0xa5b3bc8, conds=0xa5b3ec8, order=0xa9b3de68, limit=2, options=<value optimized out>, reset_auto_increment=false) at sql_delete.cc:222 #12 0x08267b8a in mysql_execute_command (thd=0xa9b3c9f8) at sql_parse.cc:3082 #13 0x0826d087 in mysql_parse (thd=0xa9b3c9f8, inBuf=0xa5b3ae0 "DELETE FROM B WHERE int_key = 52 ORDER BY RAND ( ) LIMIT 2", length=58, found_semicolon=0xa983b31c) at sql_parse.cc:5782 #14 0x0826db3f in dispatch_command (command=COM_QUERY, thd=0xa9b3c9f8, packet=0xa9b3e3b9 "DELETE FROM B WHERE int_key = 52 ORDER BY RAND ( ) LIMIT 2", packet_length=58) at sql_parse.cc:1059 #15 0x0826e930 in do_command (thd=0xa9b3c9f8) at sql_parse.cc:732 #16 0x0825f8bd in handle_one_connection (arg=0xa9b3c9f8) at sql_connect.cc:1134 #17 0x0057d32f in start_thread () from /lib/libpthread.so.0 #18 0x0049a27e in clone () from /lib/libc.so.6 #0 0x08419d25 in Transaction::getRelativeState (this=0xa820f448, record=0xa9039eb0, flags=<value optimized out>) at Transaction.cpp:870 #1 0x0840e95e in Table::fetchForUpdate (this=0xb7499268, transaction=0xa820f448, source=0xa9039eb0, usingIndex=false) at Table.cpp:3474 #2 0x083fc16b in StorageDatabase::nextRow (this=0xb709a118, storageTable=0xb61d8bb8, recordNumber=154, lockForUpdate=true) at StorageDatabase.cpp:284 #3 0x08400d0e in StorageTable::next (this=0xb61d8bb8, recordNumber=137, lockForUpdate=<value optimized out>) at StorageTable.cpp:169 #4 0x083f6aa7 in StorageInterface::rnd_next (this=0xa842458, buf=0xa842620 "Ч") at ha_falcon.cpp:578 #5 0x08344795 in filesort (thd=0xa9b592b8, table=0xa589160, sortorder=0xa86e6e8, s_length=1, select=0xa86e608, max_rows=2, sort_positions=true, examined_rows=0xa989c188) at filesort.cc:540 #6 0x082e97f1 in mysql_update (thd=0xa9b592b8, table_list=0xa86d190, fields=@0xa9b5a6d0, values=@0xa9b5a8fc, conds=0xa86e278, order_num=1, order=0xa86e3f0, limit=2, handle_duplicates=DUP_ERROR, ignore=false) at sql_update.cc:417 #7 0x0826790a in mysql_execute_command (thd=0xa9b592b8) at sql_parse.cc:2844 #8 0x0826d087 in mysql_parse (thd=0xa9b592b8, inBuf=0xa86cfb8 "UPDATE C AS X SET int_nokey = ( SELECT pk FROM E AS X WHERE X . int_key = 16 LIMIT 1) , filler = REPEAT ( ' X ' , 81 * 512 ) WHERE X . int_nokey = 42 ORDER BY RAND ( ) LIMIT 2", length=175, found_semicolon=0xa989d31c) at sql_parse.cc:5782 #9 0x0826db3f in dispatch_command (command=COM_QUERY, thd=0xa9b592b8, packet=0xa9b5b329 "UPDATE C AS X SET int_nokey = ( SELECT pk FROM E AS X WHERE X . int_key = 16 LIMIT 1) , filler = REPEAT ( ' X ' , 81 * 512 ) WHERE X . int_nokey = 42 ORDER BY RAND ( ) LIMIT 2", packet_length=175) at sql_parse.cc:1059 #10 0x0826e930 in do_command (thd=0xa9b592b8) at sql_parse.cc:732 #11 0x0825f8bd in handle_one_connection (arg=0xa9b592b8) at sql_connect.cc:1134 #12 0x0057d32f in start_thread () from /lib/libpthread.so.0 #13 0x0049a27e in clone () from /lib/libc.so.6
[6 Jun 2008 19:41]
Vladislav Vaintroub
The description suggest a cycle in the Transaction->waitingFor list
[6 Jun 2008 21:41]
Kevin Lewis
Vlad, I think you may be correct here. This reminds me of the syncUnique you added to Table::insertIndexes(). The check, then add needs to be done serially. So maybe we should add a syncDeadlock before the loop looking for a deadlock and release it after waitingFor = transaction;
[6 Jun 2008 21:52]
Philip Stoev
I think the major issue is not that this deadlock is resolved with a timeout rather than with the deadlock detector. What bothers me is that there are two cores occupied at 100% doing nothing useful for 50 seconds. This a DoS and is bad for the environment.
[7 Jun 2008 9:40]
Philip Stoev
Setting to Verified since it appears that the main cause is understood. I can still package a repeatable test case if needed -- just let me know.
[8 Jun 2008 22:39]
Ann Harrison
I think this is another case, like the unique index insert that is best handled by creating the dependency first, then checking.
[9 Jun 2008 0:24]
Vladislav Vaintroub
Ann, not sure. Certainly that would not be the simplest thing possible. Yes, there are algorithms for determining cycle in a single linked list ( like "Tortoise and hare" is described here http://en.wikipedia.org/wiki/Cycle_detection ), but dont know about the variation of the algorithm that will work with the list growing of shrinking dynamically while we're checking. I'm also afraid that in absense of synchronization, the list traversal might not be OK. I think of the case where current thread sets this->waitingFor to NULL, while concurrent thread got the old value this->waitingFor and goes on with traversal. Are all entries in this list still valid? May there be transactions there that no longer exist, pointers to freed memory and such?
[9 Jun 2008 20:58]
Kevin Lewis
Vlad, I think by moving the search after the link, then one of the threads will find the deadlock. The problem you described so well happens because threads look for a link to themselves before adding themselves to the linkage. The problem only occurs when both threads look then link. Now if they both link then look, one of them will find the deadlock. What do you think of the following change? - for (Transaction *trans = transaction->waitingFor; trans; trans = trans->waitingFor) - if (trans == this) - return true; // OK, add a reference to the transaction to keep the object around, // then wait for it to go away waitingFor = transaction; transaction->addRef(); + // But if waiting would cause a deadlock, don't try it + + for (Transaction *trans = transaction->waitingFor; trans; trans = trans->waitingFor) + if (trans == this) + { + waitingFor = NULL; + transaction->release(); + return true; + }
[9 Jun 2008 21:53]
Vladislav Vaintroub
Kevin , I'm *still* not 100% convinced. Say one transaction has determined a deadlock like described above, sets its waitingFor to NULL. A concurrent thread doing the same loop will not necessarily know about that, because it has read the old value of this->waitingFor some nanoseconds ago and continues with traversal. This way it is possible that concurrent thread will access transaction->waitingFor member , after transaction->release() happens. What I'm most afraid of, concurrent thread might do it even when transaction variable goes competely out of scope and its memory is released. Would this be possible or not (if not why not)?
[10 Jun 2008 4:03]
Kevin Lewis
It would not be possible for one of the transactions to be completely freed because there is a shared lock on transactionManager->activeTransactions.syncObject. It is possible, however, that instead of neither thread noticing the deadlock and waiting on each other for 50 seconds, that both of them will notice, if the timing is just right. But this is a small price to pay for not adding a new sync object that is locked exclusively every time one transaction waits on another.
[10 Jun 2008 12:07]
Philip Stoev
I just had a situation where there is only a single client, and it is blocked (not CPU spinning) in waitForTransaction(). Here is the output from SyncObject::stalled(): Stalled threads Thread 0x77e95e8 (1264421184) sleep=1, grant=0, locks=1, who 0, parent=(nil) Pending Transaction::waitForTransaction state 0 (2) syncObject 0x2aaaaefac488 Stalled synchronization objects: SyncObject 2aaaaefac488: state 0, monitor 0, waiters 0 This situation is of a lesser concern because it will eventually time out and the CPU is not being occupied.
[10 Jun 2008 21:03]
Vladislav Vaintroub
Kevin, >It would not be possible for one of the transactions to be completely freed because there is a shared lock on transactionManager->activeTransactions.syncObject. Agreed about Transaction::waitForTransaction(TransId). Unfortunately, there is no such lock at the moment within Transaction::getRelativeState(), that contains identical deadock detection loop. Previous comment by Philip suggests that we're waiting for a transaction that does no more exist. I asked Philip specifically about the callstack, he says it happens in getRelativeState. We probably also should introduce activeTransaction lock there or is there anything else that guarantees only active transaction are present in the waitFor list?
[10 Jun 2008 21:30]
Vladislav Vaintroub
Adding related Bug#37049. The callstack attached to it contains getRelativeState() and it has, as it seems invalid transaction object in it.
[11 Jun 2008 2:16]
Kevin Lewis
Vlad, very good find! I think that Transaction::getRelativeState() must also be protected aainst disapearing transactions while it is looking though them. Sending you by email a possible change to Transaction::getRelativeState and Transaction::waitForTransaction. Please create a patch with it if you like it.
[11 Jun 2008 13:04]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/47732 2699 Vladislav Vaintroub 2008-06-11 Bug#37251 - Livelock between UPDATE and DELETE threads The problem was a bug in deadlock detection code. 2 threads with mutual dependencies doing deadlock at exactly same time both decide there is no deadlock, add themselves to waitFor list, thus building a cycle in it. Both threads will wait for each other until lock wait timeout happens (a deadlock). Meanwhile, any other thread that traverses waitFor list may enter the cycle described above and will loop until deadlock is resolved. Solution here is to modify deadlock detection to add dependency first, than check, then remove dependency if cycle is found. Care is taken to prevent compiler optimizations and hopefully the combination of volatile specifier and COMPARE_EXHANGE will generate a memory barrier to prevent reading elements while they are modified. Note, that instead of this fancy solution, plain old SyncObject could be used. The loop is short, the lock is cheap, but I can't estimate contention for the moment. If current lock-free implementation turns out to be buggy, this alternative can be considered. No testcase provided, the situation described in bug report can not be reliably reproduced.
[16 Jun 2008 15:39]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/47917 2704 Vladislav Vaintroub 2008-06-16 Bug#37251 - Livelock between UPDATE and DELETE threads The problem was a bug in deadlock detection code. 2 threads with mutual dependencies doing deadlock at exactly same time both decide there is no deadlock, add themselves to waitFor list, thus building a cycle in it. Both threads will wait for each other until lock wait timeout happens (a deadlock). Meanwhile, any other thread that traverses waitFor list may enter the cycle described above and will loop until deadlock is resolved. Solution here is to modify deadlock detection to add dependency first, than check, then remove dependency if cycle is found. Care is taken to prevent compiler optimizations and hopefully the combination of volatile specifier and COMPARE_EXHANGE will generate a memory barrier to prevent reading elements while they are modified. Note, that instead of this fancy solution, plain old SyncObject could be used. The loop is short, the lock is cheap, but I can't estimate contention for the moment. If current lock-free implementation turns out to be buggy, this alternative can be considered. No testcase provided, the situation described in bug report can not be reliably reproduced.
[16 Jun 2008 23:45]
Kevin Lewis
This implementation of the fix looks much better.
[22 Aug 2008 20:12]
Kevin Lewis
This fix is in version 6.0.6
[8 Jan 2009 10:34]
MC Brown
A note has been added to the 6.0.6 changelog: When running a concurrent scenario involving transactions, each executing a small number of DELETE and UPDATE operations on a small number of records on FALCON tables, a deadlock could occur