Bug #89205 gap locks on READ COMMITTED cause by page split
Submitted: 12 Jan 2018 10:13 Modified: 3 May 2018 5:21
Reporter: Vlad Lesin Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S3 (Non-critical)
Version:5.7.20 OS:Any
Assigned to: CPU Architecture:Any
Tags: gap locks, innodb, page split, read committed

[12 Jan 2018 10:13] Vlad Lesin
Description:
Initially the bug was found on slave during replication. The bug caused "Lock wait timeout", as a result slave thread stopped. But the following test case does not use replication at all because it's not so important if the sequence of transactions and statements is executed by slave thread or master, what is important is the sequence itself and records layout inside of B+-tree. As slave and master can have different data layout inside of B+-tree, there can be a situation when the same sequence of transactions and statements does not cause the lock on master, but causes the lock on slaves, and vice-versa.

The isolation level is READ COMMITTED, what means there should not be gap locks except certain cases.

The sequence is the following:

1) Set S-lock on some record R1 (which can be caused by foreign checking, for example).

2) Cause page split for the page which contains R1. In the current test right-split is caused, what means that the new page is created to the right of the old page. R1 must be located so that during split process it is moved to the new right page at the first position, i.e., just after infinum.
See:
---------------
rec_t*
btr_page_split_and_insert(...) {
...
 } else if (btr_page_get_split_rec_to_right(cursor, &split_rec)) {
  direction = FSP_UP;
  hint_page_no = page_no + 1;

 } else if (btr_page_get_split_rec_to_left(cursor, &split_rec)) {
...
  left_block = block;
  right_block = new_block;

  if (!dict_table_is_locking_disabled(cursor->index->table)) {
   lock_update_split_right(right_block, left_block);
  }
....
}
----------------

3) If the above condition is true, then R1 S-lock will be copied to the supremum of left-half page, which will cause gap lock.
See
---------------
void
lock_update_split_right(...)
{
...
 /* Inherit the locks to the supremum of left page from the successor
 of the infimum on right page */

 lock_rec_inherit_to_gap(left_block, right_block,
    PAGE_HEAP_NO_SUPREMUM, heap_no);

...
}
---------------

There are several questions which have not yet been covered:

1) Is it possible to repeat the case with non-XA transactions? What is about the same but on slave? The XA transactions in the following test case were used to simplify work on the test, because the initial case used XA-transactions too.

2) This test shows only right-split case, the same issue might also be for left-split case, it needs to be checked.

See also https://bugs.launchpad.net/percona-server/+bug/1735555.

How to repeat:
Just copy the following test case in mysql-test/suite/innodb/t and run it with mtr.

=====================================
--source include/have_innodb.inc

--eval SET GLOBAL TRANSACTION ISOLATION LEVEL READ COMMITTED;

CREATE TABLE t1 (t1_pk DECIMAL(20,0) PRIMARY KEY , t1_blob BLOB) ENGINE=InnoDB;

--echo # Initial filling
INSERT INTO t1 VALUES (10000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (20000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (30000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (40000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (50000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (60000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (70000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (80000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (90000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (100000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (110000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (120000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (130000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (140000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (150000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (160000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (170000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (180000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (190000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (200000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (210000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (220000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (230000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (240000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (250000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (260000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (270000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (280000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (290000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (300000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (310000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (320000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (330000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (340000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (350000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (360000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (370000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (380000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (390000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (400000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (410000, REPEAT("a", 1082));
INSERT INTO t1 VALUES (420002, REPEAT("a", 1082));

--echo # Insert records somewhere in the middle to cause page split
INSERT INTO t1 VALUES (18100, REPEAT("a", 1082));
INSERT INTO t1 VALUES (18200, REPEAT("a", 1082));
INSERT INTO t1 VALUES (18300, REPEAT("a", 1082));
INSERT INTO t1 VALUES (18400, REPEAT("a", 1082));
INSERT INTO t1 VALUES (18500, REPEAT("a", 1082));
INSERT INTO t1 VALUES (18600, REPEAT("a", 1082));
INSERT INTO t1 VALUES (18700, REPEAT("a", 1082));
-- echo # The page is full now, the next insert will cause page split

--connect(con2,localhost,root)

CREATE TABLE t2 (t2_pk INT PRIMARY KEY, t1_pk DECIMAL(20,0),
  FOREIGN KEY (t1_pk) REFERENCES t1 (t1_pk)) ENGINE=InnoDB;

XA START '2';
--echo # Create S-lock on t1 record with pk = 30000, this record will be the
--echo # first record on the right-half page during future splitting,
--echo # pay attention, this is not a gap lock,
--echo # so this lock is permitted for READ COMMITTED
INSERT INTO t2 VALUES (1, 30000);
XA END '2';
XA PREPARE '2';

--connect (con1,localhost,root)

XA START '1';
--echo # Page right-split is here, the first record of new right-half page
--echo # is the record with pk = 30000, which was previosly S-locked,
--echo # the S-lock from this page is copied to the supremum of left-half page
--echo # causing gap-lock which is forbidden for READ COMMITED
INSERT INTO t1 VALUES (18800, REPEAT("a", 1082));
XA END '1';
XA PREPARE '1';

--connection default

XA START 'default';
--echo # Trying to insert something into the locked gap,
--echo # getting lock wait timeout here despite READ COMMITTED
INSERT INTO t1 VALUES (29900, REPEAT("a", 1082));
XA END 'default';
XA PREPARE 'default';
XA COMMIT 'default'

--connection con1
XA COMMIT '1';

--connection con2
XA COMMIT '2';

--connection con0
DROP TABLE t2;
DROP TABLE t1;
=============================

Suggested fix:
I'm not sure if my suggestion is correct because I don't fully understand the following logic:

=====================
static
void
lock_rec_inherit_to_gap()
{
...
		if (!lock_rec_get_insert_intention(lock)
		    && !((srv_locks_unsafe_for_binlog
			  || lock->trx->isolation_level
			  <= TRX_ISO_READ_COMMITTED)
			 && lock_get_mode(lock) ==
			 (lock->trx->duplicates ? LOCK_S : LOCK_X))) {
			lock_rec_add_to_queue(
				LOCK_REC | LOCK_GAP | lock_get_mode(lock),
				heir_block, heir_heap_no, lock->index,
				lock->trx, FALSE);
		}
...
}
===================

What was the intention for the following condition:

lock->trx->isolation_level <= TRX_ISO_READ_COMMITTED

Why not:

lock->trx->isolation_level < TRX_ISO_READ_COMMITTED
[12 Jan 2018 10:58] MySQL Verification Team
Hello Vlad,

Thank you for the report and test case.

Thanks,
Umesh
[18 Jan 2018 20:14] Sveta Smirnova
This bug is very serious when repeated in replication setups. If page split does not happen on master, but happens on slave it enters locking state permanently. And since several XA transactions involved recovering from this situation is not that easy as skipping single binary log position or GTID. In our case this leads to full slave rebuild.
[1 May 2018 19:56] Sveta Smirnova
I cannot repeat described behavior with version 5.7.22
[3 May 2018 5:21] MySQL Verification Team
Thank you, Sveta.
Confirmed internally from Deb that this issue has been fixed by post Bug #27189701.

## Per change log:

Fixed as of the upcoming 5.7.22, 8.0.4 release, and here's the changelog
entry:

Concurrent XA transactions that ran successfully to the XA prepare stage
on the master conflicted when replayed on the slave, resulting in a lock
wait timeout in the applier thread. The conflict was due to the GAP lock
range which differed when the transactions were replayed serially on the
slave. To prevent this type of conflict, GAP locks taken by XA
transactions in READ COMMITTED isolation level are now released (and no
longer inherited) when XA transactions reach the prepare stage.

Note: Bug 25866046 is referenced in the changelog entry.