MySQL Bugs: #37775: main.locktrans_innodb and rpl.rpl_locktrans

Bug #37775	main.locktrans_innodb and rpl.rpl_locktrans_innodb fail randomly
Submitted:	1 Jul 2008 15:46	Modified:	15 Oct 2012 14:50
Reporter:	Alexander Nozdrin	Email Updates:
Status:	Unsupported	Impact on me:	None
Category:	Tests: Server	Severity:	S3 (Non-critical)
Version:	6.0-BK,5.4	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	experimental, pb2, pushbuild, sporadic, test failure, tf54, timeout

Description:
Symptom 1:
----------------------------------------------------------
mysqltest: In included file "./include/locktrans.inc": At line 688: query 'reap' failed: 1205: Lock wait timeout exceeded; try restarting transaction

The result from queries just before the failure was:
< snip >
# which waits in background until first read lock is released.
LOCK TABLE t1 LOW_PRIORITY WRITE;
# connection default.
# Wait for the helper thread to sit on its lock.
# connection conn2.
# Take a non-transactional READ lock,
# which waits in background for the LOW_PRIORITY WRITE lock.
LOCK TABLE t1 READ;
# connection default.
# Wait for the helper threads to sit on their locks.
# Unlock this connections non-transactional lock.
UNLOCK TABLES;
# connection conn1.
# Now the LOW_PRIORITY WRITE lock is taken.
# Insert a value.
INSERT INTO t1 VALUES(1111);
# Unlock table.
UNLOCK TABLES;
# connection conn2.
# Now the READ lock is taken.
----------------------------------------------------------

Symptom 2:
----------------------------------------------------------
mysqltest: In included file "./include/locktrans.inc": At line 1026: query 'LOCK TABLE t2 WRITE' failed: 1205: Lock wait timeout exceeded; try restarting transaction

The result from queries just before the failure was:
< snip >
COMMIT;
# Create a trigger on t1 that updates t2.
CREATE TRIGGER t1_ai AFTER INSERT ON t1 FOR EACH ROW
BEGIN
# This will time out. So use a small value.
INSERT INTO t2 SELECT GET_LOCK("mysqltest1", 1);
UPDATE t2 SET c2= c2 + 111;
INSERT INTO t2 SELECT RELEASE_LOCK("mysqltest1");
END//
# Take an SQL lock which blocks the trigger.
SELECT GET_LOCK("mysqltest1", 10);
GET_LOCK("mysqltest1", 10)
1
# connection conn1.
# Insert into t1 to fire trigger. This waits on GET_LOCK.
INSERT INTO t1 VALUES(111222);
# connection default.
# Wait for the helper thread to sit on its lock.
# Take a write lock. This waits until the trigger times out.
LOCK TABLE t2 WRITE;
----------------------------------------------------------

Symptom 3:
----------------------------------------------------------
main.locktrans_innodb          [ fail ]  timeout
----------------------------------------------------------

How to repeat:
https://intranet.mysql.com/secure/pushbuild/xref.pl?testname=main.locktrans_innodb

Symptom 2 and 3 are most likely the same as BUG#38070. The test uses only 1 second for innodb_lock_wait_timeout. That's too small when running on a slow or unpredictable environment.

So it would be great to increase the timeout. However, there is one problem: the test has 16 places where it expects a timeout to happen. So on any successful run, it will wait 16*timeout seconds. Unfortunately, innodb_lock_wait_timeout is not dynamic, so we must use the same value throughout the test.

I can think of two options:

 (1) increase the timeout and mark the test as 'big'

 (2) split the test into two: one that contains all places where it is expected to timeout and one containing the rest

(1) is not nice since it causes the test to not run on pushbuild. Most 'big' tests are never run by anyone and therefore remain unmaintained. I asked Ingo to investigate (2).

It seems that our systems are slow and overloaded. Wouldn't it be great to work in a company that produces high-performance computers?

The test case must provoke several lock wait timeouts. To keep the run time of the test case short, we selected the shortes timeout value possible, 1 second. Now it turns out that a few process changes plus the code execution for a few locking statements takes longer than a second (!). So we experience timeouts for statements that are not expected to time out.

Unfortunately there is no safe solution. Some months ago, we didn't have problems with the timeout value of 1 second. We had less tests, less code, and lees teams requesting tests on the servers. We could increase the timeout value for now. But however we set it, one day the machines we will have more tests, more code, and more teams, we may add more and even slower machines, and finally we will have to increase the value again.

Anyway, increasing the timeout value is the only option. And we have two options to limit the impact.

1. Make the test a --big-test and increase the timeout value.

2. Split the test into a part without expected timeouts and a high timeout value, and a part with expected timeout and a 1 second timeout value. However, this option means that we might fail to test what we want to test. When the machines become so slow that every test times out before the expected timout happens, we don't prove that the statements time out due to lock collisions.

I had another issue with 2. The test case tries to prove certain aspects of transactional locking in a certain order. Moving some tests into a new test case might destroy the context. Careful documentation would be required. But looking at the test case, the expected timeouts concentrate in the last third of the test case, with one exception. So it wouldn't become a problem to split the test case. However, my other concern about missing the test objective holds true.

So I recommend to go with 1. It is quickly and easily to accomplish. If one day we receive support by a company that is capable of providing faster hardware, it can be reverted quickly and easily. Just put a comment into the test case explaining, why it bacame a --big-test, and how to revert it, should the times be better.

This test fails regardless of the lock_wait_timeout value.  I experimented with this on both this Bug and Bug #37635	Some big tests fails in 5.1, for the read_many_rows_innodb test.

No matter what value we set lock_wait_timeout, the test will not proceed as expected.

I suspect this is a larger problem.

Watching this test for one month to see if any further Pushbuild failures occur.

It has been a while since this test has failed in this particular manner and I have been unable to duplicate this issue on my machine with multiple run (20k on Mac OSX 10.5 and 2k on Windows XP 32 bit).

Will close out if no failures have been observed in that time.

It's still failing unfortunately.

The last seen failure:
http://pb2.norway.sun.com/web.py?template=mysql_show_test_failure&test_failure_id=2242613

XRef2:
http://pb2.norway.sun.com/web.py?template=mysql_show_test_failure&test_failure_id=&test_ou...

The test case has been marked experimental.