Bug #46746 Innodb transactional consistency is broken with UPDATE + DELETE
Submitted: 16 Aug 2009 15:21 Modified: 31 Aug 2009 13:45
Reporter: Philip Stoev Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S1 (Critical)
Version:5.0,5.1,5.4 OS:Any
Assigned to: Marko Mäkelä CPU Architecture:Any
Tags: consistency, innodb, ISOLATION, repeatable read, transaction isolation

[16 Aug 2009 15:21] Philip Stoev
Description:
When executing a two-thread transactional workload with UPDATE and DELETE, Innodb tables become temporarily inconsistent, that is, a SELECT against that table returns values that are not logically consistent.

The test starts with a table containing 100 rows with an index, all containing the same value. It then issues UPDATES that update the *entire table* to some other value, and DELETEs that delete individual rows from the table.

Such queries should always leave the table in a consistent state, e.g. all values in the table should be identical. However, a SELECT statement would see two distinct values in the table, which is a no-no.

The test runs in REPEATABLE READ mode and is deadlock and error-free. All transactions are fully-formed (e.g. start with a START TRANSACTION and end in a COMMIT). Two threads are sufficient for the issue to show up. Innodb in 5.0 to 5.4 are all affected, but Falcon is not.

How to repeat:
A test case will be uploaded shortly.
[16 Aug 2009 15:31] Philip Stoev
YY file:

query:
        transaction;

transaction:
        START TRANSACTION ; query_list ; COMMIT ;

query_list:
        query_item |
        query_item ; query_list ;

query_item:
        select | dml ;

select:
        SELECT DISTINCT /* RESULTSET_ZERO_OR_ONE_ROWS _quid */ `int_key` FROM _table ;

dml:
        UPDATE _table SET `int_key` = _int_unsigned |
        DELETE FROM _table LIMIT 1 ;

ZZ file:

$tables = {
        rows => [100],
        primary_key => [ undef ]
};

$fields = {
        types => [ 'int' ],
        indexes => [ 'key' ]
};

$data = {
        numbers => [ '5' ]
}
[16 Aug 2009 15:34] Philip Stoev
To reproduce, pull the RQG from 

bzr+ssh://bazaar.launchpad.net/~pstoev/randgen/rqg2

and then run

$ perl runall.pl \
  --grammar=conf/bug46746.yy \
  --gendata=conf/bug46746.zz \
  --basedir=/build/bzr/mysql-5.1 \
  --validator=ResultsetProperties \
  --queries=10000 \
  --engine=Innodb \
  --mysqld=--innodb-lock-wait-timeout=1 \
  --mysqld=--transaction-isolation=REPEATABLE-READ \
  --threads=2

The test will abort as soon as a SELECT DISTINCT query returns more than one row, meaning that the table contains two different values, which should never happen given the queries in this particular test.

Note that each SELECT query is tagged so that you can fish for it in the query log. Each UPDATE also updates the entire table to some distinct value, so this can also be grepped in the logs.

If you require any additional information about this bug, please set status to "Need feedback". Thank you.
[18 Aug 2009 7:59] Philip Stoev
I am only observing this issue on indexed columns with LIMIT 1. Increasing LIMIT causes the issue to go away.
[20 Aug 2009 12:49] Heikki Tuuri
What is the table definition?

There are two threads, and they run queries like:

START TRANSACTION;
UPDATE t SET col = 123;
DELETE FROM t LIMIT 1;
SELECT DISTINCT(col) from T;
...
COMMIT;

InnoDB's consistent read has one documented, intentional, inconsistency: a transaction always sees its own updates. Could that explain the strange phenomenon? Probably not.

The bug may be due to InnoDB not checking the visibility of a row in the 'read view' of the SELECT. InnoDB should note that the max trx id stamped to a secondary index page is bigger than the 'read view' of the SELECT, and check the row and its visibility from the clustered index record. I recall Marko fixing a similar, very long-standing bug in InnoDB in June 2009.
[20 Aug 2009 12:51] Heikki Tuuri
Marko, can you check if your fix is already in the tested versions?
[20 Aug 2009 13:35] Philip Stoev
Table structure:

CREATE TABLE `table100` (
 `int_key` int,
 key (`int_key` )
)  engine=innodb;
[24 Aug 2009 9:12] Marko Mäkelä
The table contains a secondary index. The bug that Heikki is probably referring to is connected to crash recovery, which I fixed in May 2009. InnoDB did not write changes to PAGE_MAX_TRX_ID in secondary index B-tree pages to the redo log. Because of this, after crash recovery, new transactions that are started before the background rollback of incomplete (pre-crash) transactions may have an inconsistent view of the secondary indexes.

As far as I know, that bug was fixed in the InnoDB Plugin 1.0.4 (r5038, 2009-05-19), but not (yet) in the built-in InnoDB in MySQL 5.0 or 5.1.

Philip, can you repeat the failure with InnoDB Plugin 1.0.4? It is possible that the built-in InnoDB does not preserve PAGE_MAX_TRX_ID when merging or splitting B-tree nodes; at least previous versions of the InnoDB Plugin did not.
[24 Aug 2009 9:38] Philip Stoev
The bug is still repeatable against the mysql-trunk tree, which contains the Innodb Plugin compiled as the built-in default storage engine.

Also, note that this test has nothing to do with recovery -- the server stays up throughout the test.
[24 Aug 2009 10:14] Marko Mäkelä
Thanks, Philip. I thought that the bug that Heikki is likely referring to is connected to recovery and has thus nothing to do with this bug, and your test confirmed that. There was a remote chance that the recovery bug fix also fixed some recovery-unrelated handling of PAGE_MAX_TRX_ID that was broken in the built-in. That is why I wanted to know if the test is repeatable in InnoDB Plugin 1.0.4 (or code based on it).
[27 Aug 2009 10:37] Marko Mäkelä
I can reproduce this in the development version of InnoDB Plugin. The bug does not reproduce on every run. Enabling UNIV_DEBUG seems to reduce reproducibility. I did not get anything printed to the error log. No assertion failed, not even after enabling all UNIV_*_DEBUG in univ.i.
[31 Aug 2009 6:52] Marko Mäkelä
I can reproduce this with the insert buffer, purge, and adaptive hash index disabled, with at most 20% of the runs. For the record, here are the commands that I used (excluding the disabling of the subsystems):

mysqld --innodb-lock-wait-timeout=1 --transaction-isolation=REPEATABLE-READ

perl gentest.pl \
--dsn dbi:mysql:host=127.0.0.1:port=3306:user=root:database=test \
--grammar=conf/bug46746.yy --gendata=conf/bug46746.zz \
--validator=ResultsetProperties \
--queries=10000 --engine=Innodb --threads=2
[31 Aug 2009 13:44] Heikki Tuuri
(16:40:30) Heikki: Is this a bug, after all...
(16:40:54) Heikki: Our SELECT may see rows that were deleted later...
(16:41:07) Heikki: It is a consistent read
(16:41:26) Heikki: And it will see rows that our transaction updated
(16:41:31) Heikki: Sorry, no bug
(16:41:46) marko: I let you decide. I attached the page dumps to the bug report, for posterity.
(16:42:11) Heikki: The reason: a consistent read SELECT sees a DIFFERENT set of rows than the locking SELECT in the UPDATE!
(16:42:25) Heikki: I should have realized this on Friday
(16:42:35) marko: right, it is obvious now :-) solution: use a locking SELECT :-)
(16:42:42) Heikki: Yes :)
(16:42:50) Heikki: No bug :)
[31 Aug 2009 13:48] Heikki Tuuri
I posted above: "InnoDB's consistent read has one documented, intentional, inconsistency: a transaction always sees its own updates. Could that explain the strange phenomenon? Probably not."

I was wrong. It CAN explain the strange phenomenon. The consistent read SELECT of our transaction T can see a row that was later deleted by another transaction T2, and that other transaction T2 committed. T will ALSO see rows that T updated. Thus, T may see different values in the secondary index column!

A workaround: use a locking SELECT ... LOCK IN SHARE MODE. Then the SELECT is consistent, guaranteed.