Bug #68021 Unexplainable InnoDB unique index locks on DELETE + INSERT with same values
Submitted: 3 Jan 2013 16:49 Modified: 28 Sep 2014 10:48
Reporter: Ovais Tariq Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S2 (Serious)
Version:5.5.29, 5.5.30, 5.6, 5.7.1 OS:Any
Assigned to: CPU Architecture:Any

[3 Jan 2013 16:49] Ovais Tariq
Description:
MySQL server version running is 5.5.29 on CentOS 6.2.

Suppose there is a table with a composite unique key. 

Now when a record is deleted and a subsequent insert is made within the same transaction that contains the same value for the unique key columns that were deleted, then InnoDB places gap lock on the gap after the unique key value and the gap before the unique key value that is inserted by the subsequent insert. This behaviour remains even though I change transaction-isolation=READ-COMMITTED and innodb_locks_unsafe_for_binlog=1. 

However, when a record is deleted and a subsequent insert within the same transaction inserts values for unique key columns that are different from the ones deleted then no excessive locking is performed. This is the behaviour that one expects for the above case as well at least when transaction-isolation=READ-COMMITTED or innodb_locks_unsafe_for_binlog=1.

Relevant my.cnf variables that were used:
log-bin
binlog_format=ROW
transaction-isolation=READ-COMMITTED
innodb_locks_unsafe_for_binlog=1

How to repeat:
-- Prepare test data
CREATE TABLE `ti` (
  `session_ref_id` bigint(16) NOT NULL AUTO_INCREMENT,
  `customer_id` bigint(16) DEFAULT NULL,
  `client_id` int(2) DEFAULT '7',
  `app_id` smallint(2) DEFAULT NULL,
  PRIMARY KEY (`session_ref_id`),
  UNIQUE KEY `uk1` (`customer_id`,`client_id`,`app_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (4000, 8000, 10, 5);
INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (4090, 9000, 10, 5);
INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (6000, 10000, 10, 5);
INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (7000, 14000, 10, 5);

---- DELETE a row with value of unique key `uk` (9000, 10, 5) and INSERT the same row in the same transaction with same value for unique key `uk` columns

-- session 1
session1 > start transaction;
Query OK, 0 rows affected (0.00 sec)

session1 > DELETE FROM ti WHERE session_ref_id = 4090;
Query OK, 1 row affected (0.00 sec)

session1 > INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (5000, 9000, 10, 5);
Query OK, 1 row affected (0.00 sec)

-- session 2
session2 > set innodb_lock_wait_timeout=1;
Query OK, 0 rows affected (0.00 sec)

session2 > start transaction;
Query OK, 0 rows affected (0.00 sec)

session2 > INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (NULL, 8001, 10, 5);
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

session2 > INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (NULL, 7999, 10, 5);
Query OK, 1 row affected (0.00 sec)

session2 > INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (NULL, 9001, 10, 5);
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

session2 > INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (NULL, 9999, 10, 5);
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

session2 > INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (NULL, 10001, 10, 5);
Query OK, 1 row affected (0.00 sec)

You can see that the whole range from customer_id=8000 to customer_id=10000 is locked, this range encapsulates the value customer_id=9000 which is inserted above.

---- DELETE a row with value of unique key `uk` (9000, 10, 5) and INSERT a row in the same transaction with different value for unique key `uk` columns

-- session 1
session1 > start transaction;
Query OK, 0 rows affected (0.00 sec)

session1 > DELETE FROM ti WHERE session_ref_id = 4090;
Query OK, 1 row affected (0.00 sec)

session1 > INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (5000, 8999, 10, 5);
Query OK, 1 row affected (0.00 sec)

-- session 2
session2 > set innodb_lock_wait_timeout=1;
Query OK, 0 rows affected (0.00 sec)

session2 > start transaction;
Query OK, 0 rows affected (0.00 sec)

session2 > INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (NULL, 8001, 10, 5);
Query OK, 1 row affected (0.00 sec)

session2 > INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (NULL, 7999, 10, 5);
Query OK, 1 row affected (0.00 sec)

session2 > INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (NULL, 9001, 10, 5);
Query OK, 1 row affected (0.00 sec)

session2 > INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (NULL, 9999, 10, 5);
Query OK, 1 row affected (0.00 sec)

session2 > INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (NULL, 10001, 10, 5);
Query OK, 1 row affected (0.00 sec)

You can see that no gap-locking is performed.
[3 Jan 2013 17:15] MySQL Verification Team
reminded me of http://bugs.mysql.com/bug.php?id=19762
[4 Jan 2013 10:55] Ovais Tariq
If I change the test case a bit, and in session 1 instead of DELTE+INSERT I execute the following:
UPDATE ti SET session_ref_id=5000 WHERE session_ref_id=4090;

Then the gap locking is still repeatable, as in the original test case.
[4 Jan 2013 20:38] Justin Swanhart
In session 1 while session 2 is waiting for the lock during the INSERT:

mysql> select * from information_schema.innodb_locks\G
*************************** 1. row ***************************
    lock_id: 2AF1:7182:4:3
lock_trx_id: 2AF1
  lock_mode: X,GAP
  lock_type: RECORD
 lock_table: `test`.`ti`
 lock_index: `uk1`
 lock_space: 7182
  lock_page: 4
   lock_rec: 3
  lock_data: 9000, 10, 5
*************************** 2. row ***************************
    lock_id: 2AF0:7182:4:3
lock_trx_id: 2AF0
  lock_mode: S
  lock_type: RECORD
 lock_table: `test`.`ti`
 lock_index: `uk1`
 lock_space: 7182
  lock_page: 4
   lock_rec: 3
  lock_data: 9000, 10, 5
2 rows in set (0.00 sec)

mysql> select * from information_schema.innodb_lock_waits\G
*************************** 1. row ***************************
requesting_trx_id: 2AF1
requested_lock_id: 2AF1:7182:4:3
  blocking_trx_id: 2AF0
 blocking_lock_id: 2AF0:7182:4:3
1 row in set (0.00 sec)
[4 Jan 2013 20:41] Justin Swanhart
mysql> select * from information_schema.innodb_trx\G
*************************** 1. row ***************************
                    trx_id: 2AF1
                 trx_state: LOCK WAIT
               trx_started: 2013-01-04 20:37:14
     trx_requested_lock_id: 2AF1:7182:4:3
          trx_wait_started: 2013-01-04 20:37:14
                trx_weight: 3
       trx_mysql_thread_id: 330
                 trx_query: INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (NULL, 8001, 10, 5)
       trx_operation_state: inserting
         trx_tables_in_use: 1
         trx_tables_locked: 1
          trx_lock_structs: 2
     trx_lock_memory_bytes: 376
           trx_rows_locked: 1
         trx_rows_modified: 1
   trx_concurrency_tickets: 0
       trx_isolation_level: REPEATABLE READ
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 10000
*************************** 2. row ***************************
                    trx_id: 2AF0
                 trx_state: RUNNING
               trx_started: 2013-01-04 20:35:46
     trx_requested_lock_id: NULL
          trx_wait_started: NULL
                trx_weight: 7
       trx_mysql_thread_id: 329
                 trx_query: select * from information_schema.innodb_trx
       trx_operation_state: NULL
         trx_tables_in_use: 0
         trx_tables_locked: 0
          trx_lock_structs: 5
     trx_lock_memory_bytes: 1248
           trx_rows_locked: 5
         trx_rows_modified: 2
   trx_concurrency_tickets: 0
       trx_isolation_level: REPEATABLE READ
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 10000
2 rows in set (0.00 sec)
[8 Jan 2013 9:55] Ovais Tariq
Verbose locking information for the transaction in session 1 holding locks:

---TRANSACTION F23, ACTIVE 25 sec
5 lock struct(s), heap size 1248, 5 row lock(s), undo log entries 2
MySQL thread id 25, OS thread handle 0x7f2c785f9700, query id 230 localhost msandbox
TABLE LOCK table `test`.`ti` trx id F23 lock mode IX
RECORD LOCKS space id 0 page no 338 n bits 72 index `PRIMARY` of table `test`.`ti` trx id F23 lock_mode X locks rec but not gap
Record lock, heap no 3 PHYSICAL RECORD: n_fields 6; compact format; info bits 32
 0: len 8; hex 8000000000000ffa; asc         ;;
 1: len 6; hex 000000000f23; asc      #;;
 2: len 7; hex 190000015a0110; asc     Z  ;;
 3: len 8; hex 8000000000002328; asc       #(;;
 4: len 4; hex 8000000a; asc     ;;
 5: len 2; hex 8005; asc   ;;

locked record from PK (record lock, no gap locking, X lock):
session_ref_id: 		4090
DB_TRX_ID (internal field): 	F23
DB_ROLL_PTR (internal field): 	0x190000015a0110
customer_id:			9000
client_id:			10
app_id:				5

RECORD LOCKS space id 0 page no 339 n bits 72 index `uk1` of table `test`.`ti` trx id F23 lock_mode X locks rec but not gap
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 32
 0: len 8; hex 8000000000002328; asc       #(;;
 1: len 4; hex 8000000a; asc     ;;
 2: len 2; hex 8005; asc   ;;
 3: len 8; hex 8000000000000ffa; asc         ;;

locked record from UK `uk1` (record lock, no gap locking):
customer_id:	9000
client_id:	10
app_id:		5
session_ref_id: 4090

RECORD LOCKS space id 0 page no 339 n bits 72 index `uk1` of table `test`.`ti` trx id F23 lock mode S
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 32
 0: len 8; hex 8000000000002328; asc       #(;;
 1: len 4; hex 8000000a; asc     ;;
 2: len 2; hex 8005; asc   ;;
 3: len 8; hex 8000000000000ffa; asc         ;;

locked record from UK `uk1` (S lock):
customer_id:	9000
client_id:	10
app_id:		5
session_ref_id: 4090

Record lock, heap no 4 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 8; hex 8000000000002710; asc       ' ;;
 1: len 4; hex 8000000a; asc     ;;
 2: len 2; hex 8005; asc   ;;
 3: len 8; hex 8000000000001770; asc        p;;

locked record from UK `uk1`: <-- interestingly no information is printed about whether this is S/X lock and whether the gap_lock bit is set, based on my tests this is also of type 'locks gap before record'
customer_id:	10000
client_id:	10
app_id:		5
session_ref_id: 6000

RECORD LOCKS space id 0 page no 339 n bits 72 index `uk1` of table `test`.`ti` trx id F23 lock mode S locks gap before rec
Record lock, heap no 6 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 8; hex 8000000000002328; asc       #(;;
 1: len 4; hex 8000000a; asc     ;;
 2: len 2; hex 8005; asc   ;;
 3: len 8; hex 8000000000001388; asc         ;;

locked record from UK `uk1` (locks gap before record, S lock): <-- this one locks the gap between records (8000, 10, 5) and (9000, 10, 5)
customer_id:	9000
client_id:	10
app_id:		5
session_ref_id: 5000
[8 Jan 2013 12:02] Ovais Tariq
If we modify the test case such that we modify the last record in the table (7000, 14000, 10, 5); then no index record that is bigger than (14000, 10, 5) is allowed to be inserted, because the gap between (14000, 10, 5) and SUPREMUM is locked.

-- session 1
session1 > select * from ti;
+----------------+-------------+-----------+--------+
| session_ref_id | customer_id | client_id | app_id |
+----------------+-------------+-----------+--------+
|           4000 |        8000 |        10 |      5 |
|           4090 |        9000 |        10 |      5 |
|           6000 |       10000 |        10 |      5 |
|           7000 |       14000 |        10 |      5 |
+----------------+-------------+-----------+--------+
4 rows in set (0.00 sec)

session1 > start transaction;
Query OK, 0 rows affected (0.00 sec)

session1 > DELETE FROM ti WHERE session_ref_id = 7000;
Query OK, 1 row affected (0.00 sec)

session1 > INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (8000, 14000, 10, 5);
Query OK, 1 row affected (0.00 sec)

-- session 2
session2 > start transaction;
Query OK, 0 rows affected (0.00 sec)

session2 > set innodb_lock_wait_timeout=1;
Query OK, 0 rows affected (0.00 sec)

session2 > INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (NULL, 13999, 10, 5);
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

session2 > INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (NULL, 9999, 10, 5);
Query OK, 1 row affected (0.00 sec)

session2 > INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (NULL, 14001, 10, 5);
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

session2 > INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (NULL, 140001, 10, 5);
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

session2 > INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (NULL, 1400001, 10, 5);
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

session2 > INSERT INTO ti (session_ref_id, customer_id, client_id, app_id) VALUES (NULL, 14000001, 10, 5);
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

-- Following are the locks held on the unique index `uk1`:
RECORD LOCKS space id 0 page no 339 n bits 72 index `uk1` of table `test`.`ti` trx id F29 lock_mode X locks rec but not gap
Record lock, heap no 5 PHYSICAL RECORD: n_fields 4; compact format; info bits 32
 0: len 8; hex 80000000000036b0; asc       6 ;;
 1: len 4; hex 8000000a; asc     ;;
 2: len 2; hex 8005; asc   ;;
 3: len 8; hex 8000000000001b58; asc        X;;

customer_id:	14000
client_id:	10
app_id:		5
session_ref_id: 7000

RECORD LOCKS space id 0 page no 339 n bits 72 index `uk1` of table `test`.`ti` trx id F29 lock mode S
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

supremum record of the UK `uk1` is locked

Record lock, heap no 5 PHYSICAL RECORD: n_fields 4; compact format; info bits 32
 0: len 8; hex 80000000000036b0; asc       6 ;;
 1: len 4; hex 8000000a; asc     ;;
 2: len 2; hex 8005; asc   ;;
 3: len 8; hex 8000000000001b58; asc        X;;

customer_id:	14000
client_id:	10
app_id:		5
session_ref_id: 7000

RECORD LOCKS space id 0 page no 339 n bits 72 index `uk1` of table `test`.`ti` trx id F29 lock mode S locks gap before rec
Record lock, heap no 6 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 8; hex 80000000000036b0; asc       6 ;;
 1: len 4; hex 8000000a; asc     ;;
 2: len 2; hex 8005; asc   ;;
 3: len 8; hex 8000000000001f40; asc        @;;

customer_id:	14000
client_id:	10
app_id:		5
session_ref_id: 8000
[12 Jan 2013 14:05] Sveta Smirnova
Thank you for the report.

Verified as described. Still can be duplicate of bug #19762 But I think we should fix it finally, because no allowing 2000 IDs like in this case is too much for users to work around.
[12 Jan 2013 14:21] Ovais Tariq
Hi Sveta,

Please take a look at my last comment before yours, it shows that under certain conditions, the gap between the record and the supremum record is locked which means no further inserts to the table in question are possible. This just increases the criticality of the bug :)
[14 Jan 2013 16:12] MySQL Verification Team
Ovais,

Every single insert at the end of some table will produce a new insert intention lock at the very end of the table.

This is expected behavior, which will not be changed with a fix for this bug.
[14 Jan 2013 16:25] MySQL Verification Team
Ovais,

I have grasped full meaning of your bug. It appears only for secondary index when insert intention lock is at the end of the table.

I am not sure how easy is this to fix, but we shall give it a try.
[15 Jan 2013 10:37] Ovais Tariq
Hi Sinisa,

Indeed transactions do acquire IX locks before writing to a table, but two IX locks are supposed to be compatible as is also shown here http://dev.mysql.com/doc/refman/5.5/en/innodb-lock-modes.html. That is, if I insert a row into a table and the transaction does not commit, then that insert would never block another insert into the same table (provided the second insert is inserting a different row). 

So I do not think what I am seeing is the expected behaviour. This can easily be checked by doing the following using the table from this bug report (note that the inserts below are at the end of the table):

session1 > start transaction;
Query OK, 0 rows affected (0.00 sec)

session 1 > insert into ti values(8000, 15000, 10, 5);
Query OK, 1 row affected (0.00 sec)

session2 > start transaction;
Query OK, 0 rows affected (0.00 sec)

session2 > insert into ti values(9000, 16000, 10, 5);
Query OK, 1 row affected (0.00 sec)

The locks held by both the transactions can be seen as follows:
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 1103, ACTIVE 9 sec
1 lock struct(s), heap size 376, 0 row lock(s), undo log entries 1
MySQL thread id 2, OS thread handle 0x7fe0a4949700, query id 14 localhost msandbox
show engine innodb status
TABLE LOCK table `test`.`ti` trx id 1103 lock mode IX
---TRANSACTION 1102, ACTIVE 68 sec
1 lock struct(s), heap size 376, 0 row lock(s), undo log entries 1
MySQL thread id 1, OS thread handle 0x7fe0a498a700, query id 10 localhost msandbox
TABLE LOCK table `test`.`ti` trx id 1102 lock mode IX

As you can see both the transactions are holding an IX lock, but that did not block each other.
[15 Jan 2013 10:41] Ovais Tariq
Hi Sinisa,

I am not entirely sure that this bug is a consequence of insert intention locks. If you go through my bug report and the locks reported by the transactions you will see that InnoDB does not show an insert intention locks being held.
[19 Feb 2013 18:08] MySQL Verification Team
Ovais,

This bug is verified and will be worked upon in not too distant future. Sorry, I can not give you more precise info, as it would violate company's policy. I like that policy, very much, because in a distant past, many promises have been made, which were not kept.
[20 Feb 2013 6:24] Ovais Tariq
Hi Sinisa,

Thanks, that's good enough.
[14 Mar 2013 19:09] Bugs System
Added a changelog entry for 5.5.32, 5.6.12, 5.7.2.

When a transaction is in READ COMMITTED isolation level, gap locks are
still taken in the secondary index when a row is inserted. This occurs
when the secondary index is scanned for duplicates. The function
"row_ins_scan_sec_index_for_duplicate()" always calls the function
"row_ins_set_shared_rec_lock()" with "LOCK_ORDINARY" irrespective of the
transaction isolation level. This fix modifies the
"row_ins_scan_sec_index_for_duplicate()" function to call
"row_ins_set_shared_rec_lock()" with "LOCK_ORDINARY" or
"LOCK_REC_NOT_GAP", based on the transaction isolation level.
[1 Jul 2014 21:05] Inaam Rana
We think this fix needs to be revisited. We have enough information to suspect that this fix has somehow broken the locking code inside InnoDB. It is very hard to come up with a repeatable test case but here is what we have seen in our production systems:
* Have isolation level READ-COMMITTED
* Have a table with unique sec. index
* Have autocommit on
* Delete a row from the table
* In two different threads try to insert a row with same value for unique sec. index column. The insertion should happen very closely followed by the deletion and both insertions should happen concurrently.
* In the cases that we have seen one of the insertion is performed by the deleting thread though I don't think it makes a difference because of autocommit.
* What happens is that both insertions succeed. Thus we get a unique key violation silently. We were able to detect the error when the replicating slave tries to apply both DMLs and spits out the error.

It is some kind of race that happens very rarely but consistently enough to show that there is a bug. We have tried:
* Reverting this patch on some of the clusters
* Moving to REPEATABLE-READ in some other clusters.
We haven't seen this error on any of the cluster with either of the fix mentioned above for over a month.

I have tried to stare at the code but can't find anything obvious. I suggest we reopen this bug and request InnoDB experts to take another look at the patch.
[2 Jul 2014 5:29] Sunny Bains
Opening the bug based on Inaam's request.
[2 Jul 2014 7:27] Annamalai Gurusami
The problem reported by Inaam will be tracked by bug#73170.
[2 Jul 2014 7:28] Annamalai Gurusami
Inaam, Thanks for bug report.
[28 Sep 2014 2:49] zhai weixiang
should this bug be reopened ?
[28 Sep 2014 10:48] Ovais Tariq
Since the fix has been reverted so I think the bug should be reopened
[4 Dec 2014 19:12] Sveta Smirnova
Issue, reported by Inaam handled in bug#73170. No need to reopen this one.
[6 Jul 2015 15:13] Mannoj Kumar
Is this fixed? I'm facing similar issue in 5.6.21 .
[12 May 2020 14:45] Niklas Deutschmann
setup

Attachment: setup.sql (application/octet-stream, text), 140 bytes.

[12 May 2020 14:50] Niklas Deutschmann
I can reproduce the problem even with 5.7.26 and 8.0.19 (transaction_isolation=READ-COMMITTED)

Run setup.sql and tx1.sql in one session, tx2.sql in the other session.

setup.sql
=========
drop table foo; 
create table foo (foo_id int auto_increment primary key, bar_id int);
create unique index bar_id_idx on foo (bar_id);

tx1.sql
=======
set autocommit=0;
start transaction;
insert into foo (bar_id) values (1);
delete from foo where foo_id = 1;
insert into foo (bar_id) values (1);
select sleep(20);
commit;
set autocommit=1;

tx2.sql
=======
set innodb_lock_wait_timeout=10;
set autocommit=0;
start transaction;
insert into foo (bar_id) values (3);
commit;
set autocommit=1;

The problem does not occur when
- the unique index is not created
- the record inserted in the first transaction is different to the record inserted and deleted before
[12 May 2020 14:53] Niklas Deutschmann
Output from "performance_schema" tables (MySQL 8.0.19)

Attachment: performance_schema.log (application/octet-stream, text), 2.29 KiB.

[17 Mar 12:28] chen zongzhi
Hello, guys. I add a simple mtr test to reproduce the secondary duplicate issue when change the lock from next-key record lock to record-not-gap record lock on secondary index duplicate check.

```
--source include/have_innodb.inc
--source include/have_debug.inc
--source include/have_debug_sync.inc

CREATE TABLE `t` (
  `pk` int(11) NOT NULL AUTO_INCREMENT,
  `sk` int(11) DEFAULT NULL,
  PRIMARY KEY (`pk`),
  UNIQUE KEY `sk` (`sk`)
) ENGINE=InnoDB AUTO_INCREMENT=10;

insert into t values(10, 9);
insert into t values(12, 7);

SET @old_innodb_stats_auto_recalc = @@innodb_stats_auto_recalc;
SET GLOBAL innodb_stats_auto_recalc = OFF;

# Block purge
SET GLOBAL innodb_purge_stop_now = ON;

SET GLOBAL transaction_isolation = 'READ-COMMITTED';

--connect(con1,localhost,root,,)
--connect(con2,localhost,root,,)
--connect(con3,localhost,root,,)

--connection con1

# Create and delete-mark an index record

CHECK TABLE t;
begin;
delete from t where sk=7;

--connection con2

--send insert into t select 1,7;

--connection con3
--send insert into t select 2,7;

--connection con1
CHECK TABLE t;
commit;

--connection con2
--reap
commit;

--connection con3
--reap
commit;

--disconnect con1
--disconnect con2
--disconnect con3

--connection default

select * from t;
CHECK TABLE t;

SET GLOBAL innodb_stats_auto_recalc = @old_innodb_stats_auto_recalc;
```

The duplicate reason is that..

before con1 commit; both con2 and con3 wait for the s record_not_gap lock on record 7 which is hold by con1 x record_not_gap lock on record 7 when con1 delete record 7 from t;

after con1 commit, both con2 and con3 pass the row_ins_scan_sec_index_for_duplicate() and going into btr_cur_optimistic_insert().

In btr_cur_optimistic_insert need to acquire lock = LOCK_X | LOCK_GAP | LOCK_INSERT_INTENTION; which is not conflict with s record_not_gap lock.

Then both con2 and con3 insert successfully, and the duplicate error happen..

The way used in upstream is change the lock from record_not_gap to next-key not in duplicate check before con1 commit, then after con1 commit,  con2 and con3 hold the s next-key lock and both acquire for  lock = LOCK_X | LOCK_GAP | LOCK_INSERT_INTENTION; Then deadlock happend and one of con2 or con3 rollback, only one of the two cons will insert successfully..

However, if we still use the next-key lock in upstream, the un-wanted next-key lock still block some un-conflict insert as the issue by Ovais Tariq.

I thought the root reason is that InnoDB use lock in secondary duplicate check, after the check, InnoDB could release the Lock request in duplicate check. However, It's hard to distinguish the lock request by duplicate check and other way. so InnoDB will keep the lock until the transaction commit..

At last, I also fix the duplicate error issue after change the duplicate check from next-key lock to record_not_gap lock.

before con1 commit, con2 and con3 only require s record_not_gap lock

after con1 commit, con2 and con3 acquire s record_not_gap lock and pass the duplicate check and going the btr_cur_optimistic_insert().

In btr_cur_optimistic_insert() I change the lock from LOCK_X | LOCK_GAP | LOCK_INSERT_INTENTION to LOCK_X | LOCK_ORDINARY | LOCK_INSERT_INTENTION if  rec and rec_next may have same unique key compare to the insert-tuple.

this is the draft code which can pass the mtr I provided at the begining and won't block other insert as provide by **Ovais Tariq**

```c++
+  ulint type_mode = LOCK_X | LOCK_GAP | LOCK_INSERT_INTENTION;
+
+  mem_heap_t *heap = NULL;
+  ulint offsets_[REC_OFFS_NORMAL_SIZE];
+  const ulint *offsets;
+  rec_offs_init(offsets_);
+  offsets =
+    rec_get_offsets(next_rec, index, offsets_, ULINT_UNDEFINED, &heap);
+
+  mem_heap_t *heap1 = NULL;
+  ulint offsets1_[REC_OFFS_NORMAL_SIZE];
+  const ulint *offsets1;
+  rec_offs_init(offsets1_);
+  offsets1 =
+    rec_get_offsets(rec, index, offsets1_, ULINT_UNDEFINED, &heap1);
+
+  /* Store old value on n_fields_cmp */
+
+  ulint n_unique;
+  ulint n_fields_cmp;
+  n_unique = dict_index_get_n_unique(index);
+  n_fields_cmp = dtuple_get_n_fields_cmp(entry);
+
+  dtuple_set_n_fields_cmp(entry, n_unique);
+
+  if (UNIV_SQL_NULL != dfield_get_len(dtuple_get_nth_field(entry, 0)) && page_rec_is_infimum(next_rec) == 0 && index->is_clustered() == 0 && page_rec_is_infim
um(rec) == 0 && page_rec_is_supremum(next_rec) == 0 && page_rec_is_supremum(rec) == 0 && ((cmp_dtuple_rec(entry, rec, index, offsets1) == 0) || cmp_dtuple_rec(
entry, next_rec, index, offsets) == 0)) {
+
+    ib::info() << "baotiao type_mode LOCK_ORDINARY";
+    type_mode = LOCK_X | LOCK_ORDINARY | LOCK_INSERT_INTENTION;
+  }

--- a/storage/innobase/row/row0ins.cc
+++ b/storage/innobase/row/row0ins.cc
@@ -1933,9 +1933,10 @@ static MY_ATTRIBUTE((warn_unused_result)) dberr_t
     found. This means it is possible for another transaction to
     insert a duplicate key value but MDL protection on DD tables
     will prevent insertion of duplicates into unique secondary indexes*/
-    const ulint lock_type =
-        index->table->skip_gap_locks() ? LOCK_REC_NOT_GAP : LOCK_ORDINARY;
+    // const ulint lock_type =
+    //     index->table->skip_gap_locks() ? LOCK_REC_NOT_GAP : LOCK_ORDINARY;

+    const ulint lock_type = LOCK_REC_NOT_GAP;
     if (page_rec_is_infimum(rec)) {
       continue;
     }
  
```
[18 Mar 9:55] Jakub Lopuszanski
Hi all,
I admire the amount of work you all put in investigation, and mostly agree with description of what is happening. The place I differ the most is the classification of the issue as a "bug" - yes, the locks seem inconvenient in some scenarios, but there are there for correctness (again, nice to see you've already found a counterexample), and while one could envision and implement other locking strategies, that would IMHO be a "feature request".
The way I see it, the main complain is that InnoDB takes gap locks if there is at least one (perhaps delete-marked) record in the unique index range, but doesn't take any lock if the range is empty, and somehow this later behaviour set your expectations for how it should work and then the former behaviour is seen as suboptimal and worth reporting a bug. But my perspective is the opposite: in general, InnoDB *needs* to lock the range of unique index which might contain conflicting values, but in the special case the range is empty, it can use a "fast path" in which it relies on short-lived latch held on the B-tree page to ensure there's no conflict while it's physically adding the record and then, after the B-tree page's latch is released it can rely on the implicit lock on the inserted record.
In pseudocode it's something like:

    find the B-tree page in the secondary index you want to insert the value to
    assert the B-tree page is latched
    equal-range = the range of records in the secondary index which conflict with your value 
    if(equal-range is not empty){
      release the latches on the B-tree and start a new mini-transaction
      for each record in equal-range
        lock gap before it, and the record itself (this is what LOCK_S does)
      also lock the gap after the last(equal-range)
      also (before Bug #32617942 was fixed) lock the record after last(equal-range)
      once you are done with all of the above, find the B-tree page again and latch it again
    }
    insert the record into the page and release the latch on the B-tree page.

Yes, there are some things that we could better.

For one thing, Bug #32617942 ROW_INS_SCAN_SEC_INDEX_FOR_DUPLICATE SHOULD NOT LOCK THE FIRST UNEQUAL RECORD addresses the issue of the first element after the range being unnecessarily locked, while all we needed for correctness is to lock the gap before it.

Another, as already pointed by you, is that in READ-COMMITTED the locks needed for correctness could have a query-duration as opposed to transaction-duration. And as you've pointed out, this is not trivial to implement, because InnoDB's Lock System reuses/merges locks placed at various moments (including: by different queries) and for various reasons, which makes it difficult to tell if the lock can be released or is still needed for some other reason. Possible way to overcome this, is to introduce a new lock mode, so that it doesn't get merged/reused with LOCK_GAP, and either introduce reference-counting, or even better, a field which remembers which query created it, so that it can be ignored once the query has finished. Then one has to somehow find the locks created during the query which are outdated and release them (which might involve waking up waiters - which means some code needs to be executed actively, as opposed to passively "ignoring" the lock). IMHO this would be more of a "feature request" not a "bug-fix". It would require extensive performance tests to see if the cost of extra field and step after each query is worth the gain.
Some preliminary work in this direction was made in Bug #29004362 LOCK->TRX->DUPLICATES CAN NOT BE USED TO DETERMINE IF LOCK IS CREATED BY IODKU so that at least the row_ins_set_rec_lock() informs the Lock System that the intended duration of the lock is lock_duation_t::AT_LEAST_STATEMENT. Currently this hint is only used to know if locks have to be inherited to a gap during purge in READ-COMMITTED and below. But I hope this gives a nice foundation for pursuing this direction in future. I must warn that this isn't very easy road: one has to somehow 'inherit' the information when locks are moved around due to purge, page-splits, merges, and care must be taken to avoid race conditions between transactions, so that every waiter is woken up exactly once (not twice, or never) etc.
 
Another direction would be to try to extend the duration for which the B-tree latches are held, so that even the case when the equal-range is non-empty is handled by relying on latches instead of locks. 
This is problematic for several reasons, which I'll only list, but not attempt to address, as I haven't thought much about solving them yet:
1. a range may span more than one B-tree page. Actually, there's no limit on how many delete-marked records with conflicting values can co-exist at the same time if purge is lagging, which means we might be forced to perform arbitrarily large mini-transaction. If there are two threads which try to perform the insert in parallel into two separate B-tree pages inside the same 'equal-range' you need to somehow coordinate between them, which naively could mean holding a lot of latches for a lot of time which is against InnoDB design (where latch=short-lived, lock=long-lived)
2. one needs to check for implicit locks on secondary index records from the 'equal range'. Checking if a single secondary index record is implicitly locked is in itself a very complicated operation, as it involves analysing the undo log chain of previous versions of the row in the clustered index, which again can take arbitrarily long and involve I/O etc. Having to do that for many rows in the equal-range could be even more problematic. BTW. I'm not sure there even exists a way to do all that in a single critical section without deadlock (due to latch-ordering violation, as you have to switch between clustered index, secondary index, undo logs, b-tree pages etc.) 
3. in case a conflicting implicit lock is found we need to wait, which means we need to not only create an explicit lock, but more importantly release the latches as we can't wait for a lock while hogging a latch. This means we need to have a way to handle retries which is provably correct

I'm not saying all of this is a show-stopper or that we aren't thinking about how to solve it.
I just don't like framing this all as a "bug". Maybe I'm the "see the glass as half-full" type of guy, but to me it looks like InnoDB has a nice fast path for the easy case, which is a nice feature, and does the correct (albeit slow) thing in the other cases, which is also nice (and is quite an achievement given the history of correctness bugs in this area).
I understand you want it to be faster. We too. Stay tuned :)
[19 Mar 19:13] chen zongzhi
Hello jakub,

Thank you for long and patience reply. 

At first, I agree that this issue should be performance issue, not a bug.

I agree with you that InnoDB should have query-duration trx lock, and I thought it's hard to implement, You need also consider the two-phase lock limitation. Release the trx lock before trx commit may cause more deadlock.

And back to the solution, I thought change the unique check lock to record_not_gap lock and change the gap|insert_intention lock to next-key|insert_intention  could be a fast fixup to this issue.
Could you find some corner case that may cause duplicate key violation or some other performance issue?
[23 Mar 3:24] Fungo Wang
Hi all,

This is an interesting discussion, I'd also like to share some finding and thinking.

At first, I guess the complexity to implement unique constraint on SK comes from multi version (MVCC), the same SK key could match many physical records (most delete-marked records, at most one alive, this is how multi version works on SK), and these physical records could span multiple pages. On the other hand, the unique constraint for primary key (clustered) is much simple, there could be only one physical record for a PK key.

How to achieve unique constraint is implementation choice, but I guess basicly it can be parted into 2 steps:
1. check all the existed physical records to see whether there is valid conflict
2. if no conflict, insert the record
step 1 and step 2 together must be protected against concurrency.

Currently, InnoDB UK depends on trx lock (gap lock especially), which is a trx thing and must obey the 2PL rule. But actually the locks used for UK constaint does not have live the same life as locks for trx, they are only need for UK constraint, not for isolation purpose, **so they can be safely released as soon as the UK operation is finished**. Statement duration is still too long, for exmaple for multi values insert :) But current lock implenmentation is too complex, it's not easy to isolate the UK lock requirement, as Jakub pointed out. 

Trx lock is one way to implement UK, but it's not the only way. Jakub also pointed out the latch way, actually this is how PostgreSQL implements UK, there is no gap-lock thing in PG. 

This is how UK works in PG by example (IIUC):

Leaf     9  10(D)    10(D) 10(D)    10(D)        10(D)  11
Pages  -----------   -----------   -----------   -----------
         page_n       page_n+1       page_n+2     page_n+3

These are part of leaf pages for a UK, all the existed physical records for key = 10 are delete-marked, and now we want to insert a new key = 10 records.

1. top-down search to page_n and latch page_n with X latch
2. check page_n and find we need to move to page_n + 1, **keep the X latch on page_n**, and get S latch on page_n+1
3. check page_n+1 and find we need to move to page_n+2, release S latch on page_n+1 and get S latch on page_n+2
4. check page_n+2 and find we need to move to page_n+3, release S latch on page_n+2 and get S latch on page_n+3
5. check page_n+3 and find there does not exist conflict, we can insert, (**the X latch on page_n is still hold**) 
6. begin to insert our key = 10, and need to find the insert localtion, page_n does not have enough space, need to move page_n+1, **get X latch on page_n+1, release X latch on page_n**
7. check page_n+1 does not have enough space, need to move page_n+2, get X latch on page_n+2, release X latch on page_n+1
8. page_n+2 is good to insert, insert to page_n+2 and release X latch on page_n+2

I just focus on how the latch is used, the detailed procedure is much more complex and can be found here https://github.com/postgres/postgres/blob/master/src/backend/access/nbtree/nbtinsert.c#L18....

The key idea is the X latch on page_n, through this single page latch, the concurrent UK insert is serialized, and UK constraint is maintained. 

For InnoDB, I guess SX latch on page_n is enough, and this could be an solution to the 1st concern:
> 1. a range may span more than one B-tree page. Actually, there's no limit on how many delete-marked records with conflicting values can co-exist at the same time if purge is lagging, which means we might be forced to perform arbitrarily large mini-transaction. If there are two threads which try to perform the insert in parallel into two separate B-tree pages inside the same 'equal-range' you need to somehow coordinate between them, which naively could mean holding a lot of latches for a lot of time which is against InnoDB design (where latch=short-lived, lock=long-lived)

The 2n and 3rd concern are real issues for InnoDB, how the history data is organized in PG is different from InnoDB, the long undo chain travel may make the SX/X latch be hold for a very long time. Maybe we can put latest trx_id info in SK record?

Another concern may be spliting (SMO), when the final insert need split page_n+2. PG using B-link tree, this is not an issue for it, but InnoDB is B+tree, the split needs upper level page latch.
[7 May 4:51] chen zongzhi
Hi all,
I modify sysbench code to test oltp_read_write case to get the performance data.
The modifications of sysbench are as follows:
diff --git a/src/lua/oltp_common.lua b/src/lua/oltp_common.lua
index 90f93e2..a4dd10c 100644
--- a/src/lua/oltp_common.lua
+++ b/src/lua/oltp_common.lua
@@ -219,11 +219,11 @@ CREATE TABLE sbtest%d(

      if (sysbench.opt.auto_inc) then
        query = string.format("(%d, '%s', '%s')",
-                               sb_rand(1, sysbench.opt.table_size), c_val,
+                               i, c_val,
                              pad_val)
      else
        query = string.format("(%d, %d, '%s', '%s')",
-                               i, sb_rand(1, sysbench.opt.table_size), c_val,
+                               i, i, c_val,
                              pad_val)
      end

@@ -235,7 +235,7 @@ CREATE TABLE sbtest%d(
  if sysbench.opt.create_secondary then
      print(string.format("Creating a secondary index on 'sbtest%d'...",
                          table_num))
-     con:query(string.format("CREATE INDEX k_%d ON sbtest%d(k)",
+     con:query(string.format("CREATE unique INDEX k_%d ON sbtest%d(k)",
                              table_num, table_num))
  end
end
@@ -480,7 +480,7 @@ function execute_delete_inserts()
      param[tnum].deletes[1]:set(id)

      param[tnum].inserts[1]:set(id)
-     param[tnum].inserts[2]:set(k)
+     param[tnum].inserts[2]:set(id)
      param[tnum].inserts[3]:set_rand_str(c_value_template)
      param[tnum].inserts[4]:set_rand_str(pad_value_template)

after the modification, the table definition is to be:
mysql> show create table sbtest.sbtest1\G
*************************** 1. row ***************************
      Table: sbtest1
Create Table: CREATE TABLE `sbtest1` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`k` int(11) NOT NULL DEFAULT '0',
`c` char(120) NOT NULL DEFAULT '',
`pad` char(60) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
UNIQUE KEY `k_1` (`k`)
) ENGINE=InnoDB AUTO_INCREMENT=11 DEFAULT CHARSET=latin1

oltp_read_write contain delete and insert, then after delete operation, insert new key need add a next-key lock as I explained before.

I run with this sysbench command:
sysbench oltp_read_write --mysql-host=xxxxxx --mysql-port=xxx --mysql-user=mysql --mysql-password=123456 --tables=1 --table_size=1000 --db-driver=mysql --threads=1 --report-interval=1 --time=12000 --rand-type=uniform --mysql-db=sbtest --point_selects=0 --range_selects=0 --index_updates=0 --non-index_updates=0 --threads=128 --time=60 prepare/run/cleanup.

after this patch, oltp_read_write get about 5% performance improvement. the more conflict is, the higher performance we will get...