Bug #34890 falcon_bug_22150.test fails on Pushbuild
Submitted: 27 Feb 2008 16:26 Modified: 8 Jan 2009 9:45
Reporter: Hakan Küçükyılmaz Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Falcon storage engine Severity:S3 (Non-critical)
Version:6.0.4 OS:Any
Assigned to: Kevin Lewis CPU Architecture:Any

[27 Feb 2008 16:26] Hakan Küçükyılmaz
Description:
falcon_bug_22150.test fails on Pushbuild.

How to repeat:
See Pushbuild for mysql-6.0-falcon and mysql-6.0-falcon-team tree.

falcon.falcon_bug_22150        [ fail ]

mysqltest: At line 75: query 'CALL p1()' failed: 1205: Lock wait timeout exceeded; try restarting transaction

The result from queries just before the failure was:
< snip >
SET v2 = rand() * 10000;
UPDATE t1 SET a = v2 WHERE a = v1;
/* SELECT ' update', v1; */
SET v1 = v1 + 1;
IF v1 MOD 50 = 0 THEN
/* SELECT ' delete'; */
DELETE FROM t1;
END IF;
END WHILE;
END//
# Sent call p1() to the server but do not pull the results
CALL p1();
# Switch to connection conn1
CALL p1();
# Switch to connection default
# Pull the results of the preceeding call p1()
# Sent call p1() to the server but do not pull the results
CALL p1();
# Switch to connection conn1
CALL p1();

More results from queries before failure can be found in /dev/shm/var-falcon-102/log/falcon_bug_22150.log
[27 Feb 2008 20:47] Kevin Lewis
Verfied by pushbuild
[3 Mar 2008 5:16] Hakan Küçükyılmaz
Analyzed falcon_bug_22150.test for timeouts. I rewrote the test to use a seed value for rand() and run the test with a wrapper script to increase the seed for every new run. The timeout is hit with different seed values. This means, that the timeout is not related rand().
[5 Mar 2008 19:45] Kevin Lewis
Recently added falcon_bug_34351_E which regularly shows a lock wait timeout.  While debugging this timout, I can see that it does not wait for 50 seconds.  So it is a faux timeout.
[5 Mar 2008 20:57] Kevin Lewis
Added falcon.falcon_bug_34890.test (instead of falcon_bug_34351_E)
[27 Mar 2008 6:28] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/44497

ChangeSet@1.2610, 2008-03-27 01:27:57-05:00, klewis@klewis-mysql. +6 -0
  Bug#34890 - hold the SyncObject::mutex to protect thread->lockGranted and 
  send it into thread->sleep so that it will overlap the PTHREADS mutex.  
  
  This prevents a race condition where a thread is woken up inadvertently,
  and while it is looping around to go back to sleep, it somehow misses
  the event or signal by the thread holding the lock.  When this happens,
  the threads waits on that SyncObject until another thread locks and 
  releases it.  This is a thread stall.
[28 Mar 2008 3:49] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/44544

ChangeSet@1.2613, 2008-03-27 22:48:57-05:00, klewis@klewis-mysql. +2 -0
  Bug#34890 - Avoid deadlock caused by getting the callers lock while 
  holding the local lock.
[28 Mar 2008 21:23] Kevin Lewis
Bug#34890 shows various kinds of stalls and seeming deadlocks in threads waiting for syncObjects. After many days of debugging and trying out changes suggested by Kelly Long who is also investigating Falcon thread stalls in high concurrency DBT2 tests, I finally had a breakthrough. I now can get falcon_bug_34890 to complete without stalling.  

This test specifically focuses on the Transaction::syncObject, where a thread representing one transaction waits for another.  Kelly has seen and described thread stalls on other sync objects.  I am convinced that a thread that is waiting to be signaled misses the signal from the thread who gives up that syncObject.  It does not take itself off the que, so it keeps waiting until the next time that syncObject is released, or the next time that thread is signaled.  

Kelly sees stalls in various threads that almost always eventually wakeup. But in the case of a Transaction::syncObject, I see threads waiting on older committed transactions that are no longer locked.  They wait for 50 seconds and then timeout.  There is no deadlock.  

The solution implemented was first conceived and tried by Kelly Long.  It involves holding SyncObject::mutex longer after the lock attempt, while SyncObject::lockGranted is being tested.  It is now passed into Thread::sleep and released at the last line before sleeping.

Hakan and Kelly have run a few DBT2 tests with higher concurrency against this code and the standard deviation of multiple runs has gone way down.  In addition, the performance of high concurrency tests has returned to previous levels or higher.

Along with this push is a changeset which eliminates the code that sets precedence in writing pages to disk.  This code is not needed in Falcon anymore.  It was used in Netfrastructure which did not have a serial log.  Kelly found that this code can sometimes traverse a BDB chain in the opposite direction of normal page access by client threads.  That is a source of a BDB deadlock that Kelly has seen.
[31 Mar 2008 22:02] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/44711

ChangeSet@1.2629, 2008-03-31 17:01:16-05:00, klewis@klewis-mysql. +5 -0
  Bug#34890 - Do not allow recursive locking of mutexes since it is not supported by pthreads.
[10 Apr 2008 19:14] Bugs System
Pushed into 6.0.5-alpha
[8 Jan 2009 9:45] MC Brown
Internal change only. No documentation needed.