MySQL Bugs: #34890: falcon_bug_22150.test fails on Pushbuild

Bug #34890	falcon_bug_22150.test fails on Pushbuild
Submitted:	27 Feb 2008 16:26	Modified:	8 Jan 2009 9:45
Reporter:	Hakan Küçükyılmaz	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S3 (Non-critical)
Version:	6.0.4	OS:	Any
Assigned to:	Kevin Lewis	CPU Architecture:	Any

Description:
falcon_bug_22150.test fails on Pushbuild.

How to repeat:
See Pushbuild for mysql-6.0-falcon and mysql-6.0-falcon-team tree.

falcon.falcon_bug_22150        [ fail ]

mysqltest: At line 75: query 'CALL p1()' failed: 1205: Lock wait timeout exceeded; try restarting transaction

The result from queries just before the failure was:
< snip >
SET v2 = rand() * 10000;
UPDATE t1 SET a = v2 WHERE a = v1;
/* SELECT ' update', v1; */
SET v1 = v1 + 1;
IF v1 MOD 50 = 0 THEN
/* SELECT ' delete'; */
DELETE FROM t1;
END IF;
END WHILE;
END//
# Sent call p1() to the server but do not pull the results
CALL p1();
# Switch to connection conn1
CALL p1();
# Switch to connection default
# Pull the results of the preceeding call p1()
# Sent call p1() to the server but do not pull the results
CALL p1();
# Switch to connection conn1
CALL p1();

More results from queries before failure can be found in /dev/shm/var-falcon-102/log/falcon_bug_22150.log

Verfied by pushbuild

Analyzed falcon_bug_22150.test for timeouts. I rewrote the test to use a seed value for rand() and run the test with a wrapper script to increase the seed for every new run. The timeout is hit with different seed values. This means, that the timeout is not related rand().

Recently added falcon_bug_34351_E which regularly shows a lock wait timeout.  While debugging this timout, I can see that it does not wait for 50 seconds.  So it is a faux timeout.

Added falcon.falcon_bug_34890.test (instead of falcon_bug_34351_E)

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/44497

ChangeSet@1.2610, 2008-03-27 01:27:57-05:00, klewis@klewis-mysql. +6 -0
  Bug#34890 - hold the SyncObject::mutex to protect thread->lockGranted and 
  send it into thread->sleep so that it will overlap the PTHREADS mutex.  
  
  This prevents a race condition where a thread is woken up inadvertently,
  and while it is looping around to go back to sleep, it somehow misses
  the event or signal by the thread holding the lock.  When this happens,
  the threads waits on that SyncObject until another thread locks and 
  releases it.  This is a thread stall.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/44544

ChangeSet@1.2613, 2008-03-27 22:48:57-05:00, klewis@klewis-mysql. +2 -0
  Bug#34890 - Avoid deadlock caused by getting the callers lock while 
  holding the local lock.

Bug#34890 shows various kinds of stalls and seeming deadlocks in threads waiting for syncObjects. After many days of debugging and trying out changes suggested by Kelly Long who is also investigating Falcon thread stalls in high concurrency DBT2 tests, I finally had a breakthrough. I now can get falcon_bug_34890 to complete without stalling.

This test specifically focuses on the Transaction::syncObject, where a thread representing one transaction waits for another. Kelly has seen and described thread stalls on other sync objects. I am convinced that a thread that is waiting to be signaled misses the signal from the thread who gives up that syncObject. It does not take itself off the que, so it keeps waiting until the next time that syncObject is released, or the next time that thread is signaled.

Kelly sees stalls in various threads that almost always eventually wakeup. But in the case of a Transaction::syncObject, I see threads waiting on older committed transactions that are no longer locked. They wait for 50 seconds and then timeout. There is no deadlock.

The solution implemented was first conceived and tried by Kelly Long. It involves holding SyncObject::mutex longer after the lock attempt, while SyncObject::lockGranted is being tested. It is now passed into Thread::sleep and released at the last line before sleeping.

Hakan and Kelly have run a few DBT2 tests with higher concurrency against this code and the standard deviation of multiple runs has gone way down. In addition, the performance of high concurrency tests has returned to previous levels or higher.

Along with this push is a changeset which eliminates the code that sets precedence in writing pages to disk. This code is not needed in Falcon anymore. It was used in Netfrastructure which did not have a serial log. Kelly found that this code can sometimes traverse a BDB chain in the opposite direction of normal page access by client threads. That is a source of a BDB deadlock that Kelly has seen.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/44711

ChangeSet@1.2629, 2008-03-31 17:01:16-05:00, klewis@klewis-mysql. +5 -0
  Bug#34890 - Do not allow recursive locking of mutexes since it is not supported by pthreads.

Pushed into 6.0.5-alpha

Internal change only. No documentation needed.