Bug #32992 | falcon_read-committed.test causes a duplicate key error *now* occasionally | ||
---|---|---|---|
Submitted: | 5 Dec 2007 9:36 | Modified: | 29 Sep 2008 19:49 |
Reporter: | Alexander Nozdrin | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Server: Falcon storage engine | Severity: | S1 (Critical) |
Version: | 6.0 | OS: | Any |
Assigned to: | Hakan Küçükyılmaz | CPU Architecture: | Any |
Tags: | pushbuild, test failure |
[5 Dec 2007 9:36]
Alexander Nozdrin
[5 Dec 2007 10:01]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/39292 ChangeSet@1.2740, 2007-12-05 13:05:02+03:00, anozdrin@station. +2 -0 Disable failing tests: - BUG#32992: falcon_read-committed.test causes a server crash occasionally; - BUG#32993: falcon_ts.test fails sporadically; - BUG#32994: subselect_sj fails sporadically (timeout).
[5 Dec 2007 10:34]
Alexander Nozdrin
The test case disabled in 6.0-runtime.
[5 Dec 2007 11:53]
Hakan Küçükyılmaz
Verified as described. Test case fails from time to time.
[6 Dec 2007 10:01]
Bugs System
Pushed into 6.0.5-alpha
[6 Dec 2007 20:51]
Kevin Lewis
This might be associated with the previous test that restarts the engine. Falcon_options2.test. Jim will need to work on this. It might be similar to the 'final bug fix' Jim pushed Monday into 6.0.4 before the clone-off which was related to restarting the engine several times. I don't know how common this crash would be. I think this is very uncommon and is related to quick restarts. But for the fix that Jim pushed Monday, he had this to say... The bug was a failure to rebuild a Section page when required during recovery. The bug has been there indefinitely. Something changed in the timing and sequence of the test suite that brought it out. And there was probably no way we could have caught the bug by systematic testing. I caught the bug by doing a full database validation following recovery that would force a bugcheck on any database corruption. While this validated the diagnosis and fix, it should also give us a greater confidence that the recovery is robust and tested. I will add a configuration parameter post beta to force post-recovery validation. Since recovery is automatic on every server start and validation is potentially quite expensive, I don't want to make it the default. But we should always use it when testing. Please confirm that my fix got into the code base before the clone-off? I consider it essential for database integrity.
[17 Dec 2007 17:01]
Kevin Lewis
Triage values set according to this assessment by Jim; For the last week or two we have been getting sporadic failures on the Falcon pushbuild test suite on a server restart. It is reproducible in the sense that almost every pushbuild shows a single instance of the problem. However, we have not gotten a failure on a hand-run test series. Resultantly, we don't have anything that might point us at the problem -- no left-over database files, no traces, nothing. I believe that there is a timing sensitive recovery problem. It's infrequency, however, suggests that it is an artifact of the test suite. The fact that we have never seen it when manually running the test suite or during regular debugging and testing suggests to me that it highly unlikely to occur in situ. Unless we can find a way to reproduce it in an environment that can capture state that we send Falcon to beta with the problem outstanding. The recent history of recovery bugs has been that a snapshot of the database and log files had lead to diagnosis, fix, and recovery of the database. My estimate is that the probability of hitting the bug in the field is probably about 40%, and if so, a fix leading to a successful recovery is about 80%. On the other hand, without post-failure files or a trace, there is absolutely nothing to go on.
[23 Jan 2008 21:39]
Kevin Lewis
Jim Starkey pushed a change to Falcon that requires it ro wait for a Page Cache Flush at the end of recovery. It used to do this, but when the pool of IO threads were added, the recovery process just called for a flush to happen and then returned. Now the flush waits for all the IO threads to complete. When recovery does not wait for the flush to complete, new blocks can be written to the serial log before all the pages from the previous recovery can be written to disk. These blocks indicate that recovery should start after the previous one. If a shutdown happens before these pages are fully written, The following recovery will not start early enough. This very likely solves the double recovery problems exhibited by Bugs 32992, 33608 & 33665. I am putting these bugs into Patch Pending so that this fix can be verified, or not.
[25 Jan 2008 23:08]
Kevin Lewis
Explanation of the double recovery problem by Ann Harrison; ============================== We have been plagued recently with failures in recovery after a previous recovery had not succeeded. That turns out to have been the result of a change to the page cache checkpoint algorithm. The fix was simple, once the cause was understood. I'm going to try to explain the interaction of recovery and checkpointing. When Falcon starts, it determines whether it was shutdown cleanly. If not, it invokes recovery to apply the remaining active parts of the serial log. Recovery begins by finding the position in the log after which changes may need to be applied. That position is called the recovery start point. Every entry in the serial log includes the recovery start point that was current when the entry was written. If the shutdown was not clean, Falcon begins recovery by identifying the active log, reading the last entry, then backing up to its recovery start point. Recovery consists of applying changes in the serial log to page images in the page cache, until they are all done, then flushing the cache by forcing a checkpoint. The test systems often shut down the server before Falcon has moved all changes from the serial log to the main database. After some errors, the test system restarts the server again, causing Falcon to start recovery, then shutdown again. For the last several weeks, recovery has failed on the second restart. One odd characteristic Jim noticed was that if he restarted from an earlier stable point in the serial log, recovery succeeded. Originally, a cache checkpoint consisted of a single I/O thread writing all the dirty pages from the cache, using buffered I/O without an fsync. That was pretty fast, but not durable after an ill-timed crash. Adding an fsync caused a checkpoint to take longer than the interval between checkpoints, which was a performance problem. A checkpoint now causes dirty pages in the cache to be consolidated in a separate area so contiguous pages can be flushed in a single write, then a bevy of I/O threads push the page clusters to disk using direct I/O or buffered I/O with a sync on each write (O_SYNC) depending on the capabilities of the system. The original page cache checkpoint got all pages into the system file cache before it continued. The new system is asynchronous. The problem was that the recovery system declared the database recovered before the dirty pages in the page cache were actually on disk. Other processing began, including writes to the serial log with a new recovery start point. If the server stopped at that point, the next attempt to recover would fail because the first recovery did not get fully applied to the database. The solution is for recovery to wait until the I/O threads are complete.
[11 Feb 2008 19:51]
Kevin Lewis
Patch is in mysql-6.0-falcon-team and mysql-6.0-release version 6.0.4
[12 Mar 2008 23:01]
Bugs System
Pushed into 6.0.4-alpha
[20 Apr 2008 21:21]
Konstantin Osipov
The test now fails with a different symptom: mysqltest: At line 30: query 'INSERT INTO t1 VALUES (1)' failed: 1022: Can't write; duplicate key in table 't1' The result from queries just before the failure was: SET storage_engine = Falcon; DROP TABLE IF EXISTS t1; CREATE TABLE t1 (a int); SHOW CREATE TABLE t1; Table Create Table t1 CREATE TABLE `t1` ( `a` int(11) DEFAULT NULL ) ENGINE=Falcon DEFAULT CHARSET=latin1 DROP TABLE t1; SET storage_engine = Falcon; SET @@tx_isolation = 'REPEATABLE-READ'; SET @@autocommit=0; SET @@tx_isolation = 'REPEATABLE-READ'; SET @@autocommit=0; BEGIN; CREATE TABLE t1 (a int); COMMIT; BEGIN; INSERT INTO t1 VALUES (1); https://intranet.mysql.com/secure/pushbuild/getlog.pl?dir=mysql-6.0&entry=kosipov@bodhi.no... The same failure occurs on other Windows hosts: https://intranet.mysql.com/secure/pushbuild/getlog.pl?dir=mysql-6.0&entry=kosipov@bodhi.no...
[26 May 2008 14:01]
Ingo Strüwing
I'm going to disable the test for now. Please re-enable after fix.
[26 May 2008 14:03]
Ingo Strüwing
Weird - I found it in mysql-test/suite/falcon/t/disabled.def already, but the test failed anyway...
[10 Jul 2008 17:00]
Kevin Lewis
Hakan, The current issue with this testcase is that it fails intermittently. It has been moved to falcon_team to make pushbuild green. falcon_team has not been running since the move to bazaar. The error is failed: 1022: Can't write; duplicate key in table 't1'. But 't1' doe not have ANY keys. I suspect this is a testcase issue somehow. Please try to tweak the testcase until is occurs more reliably or it does not occur at all. I ran it 50 times in a row and could not reproduce this.