Bug #37775 | main.locktrans_innodb and rpl.rpl_locktrans_innodb fail randomly | ||
---|---|---|---|
Submitted: | 1 Jul 2008 15:46 | Modified: | 15 Oct 2012 14:50 |
Reporter: | Alexander Nozdrin | Email Updates: | |
Status: | Unsupported | Impact on me: | |
Category: | Tests: Server | Severity: | S3 (Non-critical) |
Version: | 6.0-BK,5.4 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | experimental, pb2, pushbuild, sporadic, test failure, tf54, timeout |
[1 Jul 2008 15:46]
Alexander Nozdrin
[8 Dec 2008 10:42]
Sven Sandberg
Symptom 2 and 3 are most likely the same as BUG#38070. The test uses only 1 second for innodb_lock_wait_timeout. That's too small when running on a slow or unpredictable environment. So it would be great to increase the timeout. However, there is one problem: the test has 16 places where it expects a timeout to happen. So on any successful run, it will wait 16*timeout seconds. Unfortunately, innodb_lock_wait_timeout is not dynamic, so we must use the same value throughout the test. I can think of two options: (1) increase the timeout and mark the test as 'big' (2) split the test into two: one that contains all places where it is expected to timeout and one containing the rest (1) is not nice since it causes the test to not run on pushbuild. Most 'big' tests are never run by anyone and therefore remain unmaintained. I asked Ingo to investigate (2).
[8 Dec 2008 16:24]
Ingo Strüwing
It seems that our systems are slow and overloaded. Wouldn't it be great to work in a company that produces high-performance computers? The test case must provoke several lock wait timeouts. To keep the run time of the test case short, we selected the shortes timeout value possible, 1 second. Now it turns out that a few process changes plus the code execution for a few locking statements takes longer than a second (!). So we experience timeouts for statements that are not expected to time out. Unfortunately there is no safe solution. Some months ago, we didn't have problems with the timeout value of 1 second. We had less tests, less code, and lees teams requesting tests on the servers. We could increase the timeout value for now. But however we set it, one day the machines we will have more tests, more code, and more teams, we may add more and even slower machines, and finally we will have to increase the value again. Anyway, increasing the timeout value is the only option. And we have two options to limit the impact. 1. Make the test a --big-test and increase the timeout value. 2. Split the test into a part without expected timeouts and a high timeout value, and a part with expected timeout and a 1 second timeout value. However, this option means that we might fail to test what we want to test. When the machines become so slow that every test times out before the expected timout happens, we don't prove that the statements time out due to lock collisions. I had another issue with 2. The test case tries to prove certain aspects of transactional locking in a certain order. Moving some tests into a new test case might destroy the context. Careful documentation would be required. But looking at the test case, the expected timeouts concentrate in the last third of the test case, with one exception. So it wouldn't become a problem to split the test case. However, my other concern about missing the test objective holds true. So I recommend to go with 1. It is quickly and easily to accomplish. If one day we receive support by a company that is capable of providing faster hardware, it can be reverted quickly and easily. Just put a comment into the test case explaining, why it bacame a --big-test, and how to revert it, should the times be better.
[5 Jan 2009 23:09]
Patrick Crews
This test fails regardless of the lock_wait_timeout value. I experimented with this on both this Bug and Bug #37635 Some big tests fails in 5.1, for the read_many_rows_innodb test. No matter what value we set lock_wait_timeout, the test will not proceed as expected. I suspect this is a larger problem.
[15 Jun 2009 14:23]
Patrick Crews
Watching this test for one month to see if any further Pushbuild failures occur. It has been a while since this test has failed in this particular manner and I have been unable to duplicate this issue on my machine with multiple run (20k on Mac OSX 10.5 and 2k on Windows XP 32 bit). Will close out if no failures have been observed in that time.
[7 Jul 2009 12:14]
Alexander Nozdrin
It's still failing unfortunately. The last seen failure: http://pb2.norway.sun.com/web.py?template=mysql_show_test_failure&test_failure_id=2242613 XRef2: http://pb2.norway.sun.com/web.py?template=mysql_show_test_failure&test_failure_id=&test_ou...
[29 Sep 2009 7:04]
Alexander Nozdrin
The test case has been marked experimental.