Bug #27197 Phantom reads occuring under innodb serializable isolation level
Submitted: 16 Mar 2007 3:12 Modified: 25 Oct 2007 1:32
Reporter: Ben Vandiver Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S2 (Serious)
Version:5.1.16, 4.1.22-max OS:Linux (Linux)
Assigned to: Vasil Dimov CPU Architecture:Any

[16 Mar 2007 3:12] Ben Vandiver
Description:
I'm seeing phantom reads appear when running in serializable mode.

Details:
  InnoDB tables
  connecting with JDBC (mysql-connector-java-3.1.6-bin.jar)
  reproduced under 4.1.22-max and 5.1.16-beta
  pool of threads, some doing inserts, others doing selects
  happens intermittently

CREATE TABLE order_table_backup (  o_id integer not null,  o_d_id integer not null,  o_w_id integer not null,  o_c_id integer,  o_entry_d timestamp,  o_carrier_id integer,  o_ol_cnt integer,  o_all_local integer,  primary key (o_w_id, o_d_id, o_id)) Engine=InnoDB

Insert:
"insert into order_table (o_id,o_d_id,o_w_id,o_c_id,o_entry_d,o_ol_cnt,o_all_local) values ("+o+","+d+","+w+","+id+",'"+new Timestamp(System.currentTimeMillis())+"',"+0+","+0+")"

Query:
"select o_id,o_c_id from order_table where o_d_id="+d+" and "+
				      "o_w_id="+w+" order by o_id desc"

When you run the query twice in the same transaction (autocommit off), the second query sometimes returns an extra row.

How to repeat:
see attached java file for details

amusingly enough, the phantoms do not show up under REPEATABLE_READ.
[16 Mar 2007 3:13] Ben Vandiver
Java program to exhibit bug

Attachment: BugTest.java (text/java), 4.56 KiB.

[16 Mar 2007 3:51] Ben Vandiver
Also worth noting:

Tests were run on a dual cpu Xeon with Hyperthreading.

When you run the test against SQLServer, it does not report any phantoms under serializable isolation.
[16 Mar 2007 14:54] Heikki Tuuri
Ben,

please use:

http://dev.mysql.com/doc/refman/5.0/en/innodb-transaction-isolation.html

"
You can determine the global and session transaction isolation levels by checking the value of the tx_isolation system variable with these statements: 

SELECT @@global.tx_isolation;
SELECT @@tx_isolation;
"

to determine if your connections really are running on the SERIALIZABLE level.

Please use innodb_lock_monitor to determine that InnoDB really sets S-locks on each record it SELECTs when the connection has the SERIALIZABLE level:

http://dev.mysql.com/doc/refman/5.0/en/innodb-monitor.html

Are you sure that you do not receive a deadlock error during your transactions? A deadlock causes InnoDB to roll back the current transaction and start a new one. That could explain why you apparently see phantom rows in subsequent SELECTs. That could also explain why you do not see phantoms with REPEATABLE READ: since SELECTs do not set locks on that level, deadlocks do not happen as easily.

Regards,

Heikki
[16 Mar 2007 15:43] Ben Vandiver
The java program I included checks the isolation level on start and once again every 100 transactions to ensure that it hasn't changed.  I was using "show variables like 'tx_isolation'", but when I adjust it to use @@tx_isolation and @@global.tx_isolation, it reports the same thing (serializable).

innodb_lock_monitor shows that it is acquiring shared locks.

A deadlock would have resulted in an exception being thrown and the second query not executing at all, thus no miscompares.
[16 Mar 2007 16:04] Heikki Tuuri
Ben,

please check that really AUTOCOMMIT=0 in your connections.

Please post some examples of l1 and l2 differing.

If the bug is real, then InnoDB's next-key locking is leaking in l1, and it is allowing someone to insert another row where o_d_id = ... and o_w_id = ...

Do you see INSERTs sometimes waiting for 10 seconds for the SELECT query to commit?

Regards,

Heikki
[16 Mar 2007 16:40] Ben Vandiver
I added a select @@autocommit to be run periodically, and it starts and remains 0.

Some example mismatches:
[457, 418, 236, 10]
[500, 457, 418, 236, 10]

[893, 420, 156, 106]
[1005, 893, 420, 156, 106]

..they all follow this pattern (one extra, presumably just inserted, tuple).  The mismatch gets more frequent as the database grows in size.

Inserts do sometimes take more than 10sec to execute, as observed from the java side.  This could be inaccurate due to threading issues/etc, and I don't know how to extract this information from mysql directly.
[16 Mar 2007 16:48] Heikki Tuuri
Ben,

hmm... there could be a bug in InnoDB's next key locking in ORDER BY ... DESC!

I am trying to repeat this with a Perl test program.

Do you get EXPLAIN SELECT like below?

mysql> explain select * from order_table_backup where o_w_id = 5 and o_d_id = 237 order by o_id desc;
+----+-------------+--------------------+------+---------------+---------+---------+-------------+------+-------------+
| id | select_type | table              | type | possible_keys | key     | key_len | ref         | rows | Extra       |
+----+-------------+--------------------+------+---------------+---------+---------+-------------+------+-------------+
|  1 | SIMPLE      | order_table_backup | ref  | PRIMARY       | PRIMARY | 8       | const,const |   94 | Using where |
+----+-------------+--------------------+------+---------------+---------+---------+-------------+------+-------------+
1 row in set (0.00 sec)

mysql>

Regards,

Heikki
[16 Mar 2007 16:53] Heikki Tuuri
Ben,

ok, I am able to repeat this.

I will find the bug in 15 minutes...

Thank you,

Heikki
[16 Mar 2007 17:28] Heikki Tuuri
Hi!

The bug indeed is that InnoDB positions the index cursor at the biggest o_id for the corresponding o_w_id and o_d_id pair, and moves that cursor down. InnoDB forgets to set the 'gap' lock on the successor record!

In MySQL-3.23 this bug did not surface, because it never moved a cursor downwards. That is how the bug managed to slip into >= MySQL-4.0.

A possible fix is in row0sel.c (5.1) to add after:

        } else if (dtuple_get_n_fields(search_tuple) > 0) {

                btr_pcur_open_with_no_init(index, search_tuple, mode,
                                           BTR_SEARCH_LEAF,
                                           pcur, 0, &mtr);

                pcur->trx_if_known = trx;

the new code:
                rec = btr_pcur_get_rec(pcur);

                if (!moves_up
                    && !page_rec_is_supremum(rec)
                    && set_also_gap_locks
                    && !(srv_locks_unsafe_for_binlog
                         || trx->isolation_level == TRX_ISO_READ_COMMITTED)
                    && prebuilt->select_lock_type != LOCK_NONE) {

                        /* Try to place a gap lock on the index record
                        to prevent phantoms in ORDER BY ... DESC queries */

                        offsets = rec_get_offsets(rec, index, offsets,
                                                  ULINT_UNDEFINED, &heap);
                        err = sel_set_rec_lock(rec, index, offsets,
                                               prebuilt->select_lock_type,
                                               LOCK_GAP, thr);

                        if (err != DB_SUCCESS) {

                                goto lock_wait_or_error;
                        }
                }

Assigning this to Vasil. The bug should be fixed in 5.0 and 5.1. I think this is not critical enough to be fixed in 4.1.

Thank you,

Heikki
[16 Mar 2007 17:28] Heikki Tuuri
Hi!

The bug indeed is that InnoDB positions the index cursor at the biggest o_id for the corresponding o_w_id and o_d_id pair, and moves that cursor down. InnoDB forgets to set the 'gap' lock on the successor record!

In MySQL-3.23 this bug did not surface, because it never moved a cursor downwards. That is how the bug managed to slip into >= MySQL-4.0.

A possible fix is in row0sel.c (5.1) to add after:

        } else if (dtuple_get_n_fields(search_tuple) > 0) {

                btr_pcur_open_with_no_init(index, search_tuple, mode,
                                           BTR_SEARCH_LEAF,
                                           pcur, 0, &mtr);

                pcur->trx_if_known = trx;

the new code:
                rec = btr_pcur_get_rec(pcur);

                if (!moves_up
                    && !page_rec_is_supremum(rec)
                    && set_also_gap_locks
                    && !(srv_locks_unsafe_for_binlog
                         || trx->isolation_level == TRX_ISO_READ_COMMITTED)
                    && prebuilt->select_lock_type != LOCK_NONE) {

                        /* Try to place a gap lock on the index record
                        to prevent phantoms in ORDER BY ... DESC queries */

                        offsets = rec_get_offsets(rec, index, offsets,
                                                  ULINT_UNDEFINED, &heap);
                        err = sel_set_rec_lock(rec, index, offsets,
                                               prebuilt->select_lock_type,
                                               LOCK_GAP, thr);

                        if (err != DB_SUCCESS) {

                                goto lock_wait_or_error;
                        }
                }

Assigning this to Vasil. The bug should be fixed in 5.0 and 5.1. I think this is not critical enough to be fixed in 4.1.

Thank you,

Heikki
[16 Mar 2007 17:37] Heikki Tuuri
Oops, I forgot to add the page_rec_get_next()!

The new code should be:

                rec = btr_pcur_get_rec(pcur);

                if (!moves_up
                    && !page_rec_is_supremum(rec)
                    && set_also_gap_locks
                    && !(srv_locks_unsafe_for_binlog
                         || trx->isolation_level == TRX_ISO_READ_COMMITTED)
                    && prebuilt->select_lock_type != LOCK_NONE) {

                        /* Try to place a gap lock on the next index record
                        to prevent phantoms in ORDER BY ... DESC queries */

                        offsets = rec_get_offsets(page_rec_get_next(rec),
                                                  index, offsets,
                                                  ULINT_UNDEFINED, &heap);
                        err = sel_set_rec_lock(page_rec_get_next(rec),
                                               index, offsets,
                                               prebuilt->select_lock_type,
                                               LOCK_GAP, thr);

                        if (err != DB_SUCCESS) {

                                goto lock_wait_or_error;
                        }
                }
[17 Apr 2007 13:42] Heikki Tuuri
Vasil,

to repeat the bug, please create a table with a two column primary key (a, b). Populate the table with some test data where you have several rows with a = 10, b = 5, b =6, ... Then do:

SET AUTOCOMMIT=0;
SELECT * FROM t WHERE a = 10 ORDER BY b DESC FOR UPDATE;
...

Now, you are able to insert a row (10, 4, ...) in another connection, though the first connection has locked all rows where a = 10. This is the bug!

Regards,

Heikki
[25 Oct 2007 1:32] Paul DuBois
Noted in 5.0.46, 5.1.21 changelogs.

Phantom reads could occur under InnoDB serializable isolation level.
[22 Sep 2010 11:04] Marko Mäkelä
This patch introduced Bug #56716.