Bug #34890 | falcon_bug_22150.test fails on Pushbuild | ||
---|---|---|---|
Submitted: | 27 Feb 2008 16:26 | Modified: | 8 Jan 2009 9:45 |
Reporter: | Hakan Küçükyılmaz | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Falcon storage engine | Severity: | S3 (Non-critical) |
Version: | 6.0.4 | OS: | Any |
Assigned to: | Kevin Lewis | CPU Architecture: | Any |
[27 Feb 2008 16:26]
Hakan Küçükyılmaz
[27 Feb 2008 20:47]
Kevin Lewis
Verfied by pushbuild
[3 Mar 2008 5:16]
Hakan Küçükyılmaz
Analyzed falcon_bug_22150.test for timeouts. I rewrote the test to use a seed value for rand() and run the test with a wrapper script to increase the seed for every new run. The timeout is hit with different seed values. This means, that the timeout is not related rand().
[5 Mar 2008 19:45]
Kevin Lewis
Recently added falcon_bug_34351_E which regularly shows a lock wait timeout. While debugging this timout, I can see that it does not wait for 50 seconds. So it is a faux timeout.
[5 Mar 2008 20:57]
Kevin Lewis
Added falcon.falcon_bug_34890.test (instead of falcon_bug_34351_E)
[27 Mar 2008 6:28]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/44497 ChangeSet@1.2610, 2008-03-27 01:27:57-05:00, klewis@klewis-mysql. +6 -0 Bug#34890 - hold the SyncObject::mutex to protect thread->lockGranted and send it into thread->sleep so that it will overlap the PTHREADS mutex. This prevents a race condition where a thread is woken up inadvertently, and while it is looping around to go back to sleep, it somehow misses the event or signal by the thread holding the lock. When this happens, the threads waits on that SyncObject until another thread locks and releases it. This is a thread stall.
[28 Mar 2008 3:49]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/44544 ChangeSet@1.2613, 2008-03-27 22:48:57-05:00, klewis@klewis-mysql. +2 -0 Bug#34890 - Avoid deadlock caused by getting the callers lock while holding the local lock.
[28 Mar 2008 21:23]
Kevin Lewis
Bug#34890 shows various kinds of stalls and seeming deadlocks in threads waiting for syncObjects. After many days of debugging and trying out changes suggested by Kelly Long who is also investigating Falcon thread stalls in high concurrency DBT2 tests, I finally had a breakthrough. I now can get falcon_bug_34890 to complete without stalling. This test specifically focuses on the Transaction::syncObject, where a thread representing one transaction waits for another. Kelly has seen and described thread stalls on other sync objects. I am convinced that a thread that is waiting to be signaled misses the signal from the thread who gives up that syncObject. It does not take itself off the que, so it keeps waiting until the next time that syncObject is released, or the next time that thread is signaled. Kelly sees stalls in various threads that almost always eventually wakeup. But in the case of a Transaction::syncObject, I see threads waiting on older committed transactions that are no longer locked. They wait for 50 seconds and then timeout. There is no deadlock. The solution implemented was first conceived and tried by Kelly Long. It involves holding SyncObject::mutex longer after the lock attempt, while SyncObject::lockGranted is being tested. It is now passed into Thread::sleep and released at the last line before sleeping. Hakan and Kelly have run a few DBT2 tests with higher concurrency against this code and the standard deviation of multiple runs has gone way down. In addition, the performance of high concurrency tests has returned to previous levels or higher. Along with this push is a changeset which eliminates the code that sets precedence in writing pages to disk. This code is not needed in Falcon anymore. It was used in Netfrastructure which did not have a serial log. Kelly found that this code can sometimes traverse a BDB chain in the opposite direction of normal page access by client threads. That is a source of a BDB deadlock that Kelly has seen.
[31 Mar 2008 22:02]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/44711 ChangeSet@1.2629, 2008-03-31 17:01:16-05:00, klewis@klewis-mysql. +5 -0 Bug#34890 - Do not allow recursive locking of mutexes since it is not supported by pthreads.
[10 Apr 2008 19:14]
Bugs System
Pushed into 6.0.5-alpha
[8 Jan 2009 9:45]
MC Brown
Internal change only. No documentation needed.