Bug #98324 Deadlocks more frequent since version 5.7.26
Submitted: 22 Jan 9:57 Modified: 1 Apr 14:14
Reporter: Przemyslaw Malkowski Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S3 (Non-critical)
Version:5.7.26, 5.7.29 OS:Any
Assigned to: CPU Architecture:Any
Tags: deadlock, innodb

[22 Jan 9:57] Przemyslaw Malkowski
Description:
More deadlocks are observed after minor version upgrade from 5.7.25 to 5.7.26 or higher.
For a simple test case with REPLACE statements, I am getting around 20x more deadlock messages after upgrade.

I wonder if that change in 5.7.26 could be possibly related:
"InnoDB: Two sessions concurrently executing an INSERT ... ON DUPLICATE KEY UPDATE operation generated a deadlock. During partial rollback of a tuple, another session could update it. The fix for this bug reverts fixes for Bug #11758237, Bug #17604730, and Bug #20040791. (Bug #25966845)"

How to repeat:
Set up simple sandbox instances with all defaults (I used dbdeployer), using MySQL Community 5.7.25 and .26. 
Create a simple table like:

CREATE TABLE test (id int unsigned NOT NULL AUTO_INCREMENT, a varchar(20), b int, c timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, d varchar(10), PRIMARY KEY (id), UNIQUE KEY uk1 (a,b,d)) ENGINE=InnoDB ;

Then run concurrent REPLACE loops against both. My results are as follows.

For 5.7.25:

$ for i in {1..2000}; do msb_5_7_25/use test -e "REPLACE INTO test SET id=NULL, a='foobar', b = 2, d = 'boobar' "; done & for i in {1..1000}; do msb_5_7_25/use test -e "REPLACE INTO test SET id=NULL, a='foobar', b = 1, d = 'boobar' "; done &
[1] 23674
[2] 23675
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction

[1]-  Done                    for i in {1..2000};
do
    msb_5_7_25/use test -e "REPLACE INTO test SET id=NULL, a='foobar', b = 2, d = 'boobar' ";
done
[2]+  Done                    for i in {1..1000};
do
    msb_5_7_25/use test -e "REPLACE INTO test SET id=NULL, a='foobar', b = 1, d = 'boobar' ";
done

For 5.7.26:

$ for i in {1..2000}; do msb_5_7_26/use test -e "REPLACE INTO test SET id=NULL, a='foobar', b = 2, d = 'boobar' "; done & for i in {1..1000}; do msb_5_7_26/use test -e "REPLACE INTO test SET id=NULL, a='foobar', b = 1, d = 'boobar' "; done &
[1] 9343
[2] 9344
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction

[1]-  Done                    for i in {1..2000};
do
    msb_5_7_26/use test -e "REPLACE INTO test SET id=NULL, a='foobar', b = 2, d = 'boobar' ";
done
[2]+  Done                    for i in {1..1000};
do
    msb_5_7_26/use test -e "REPLACE INTO test SET id=NULL, a='foobar', b = 1, d = 'boobar' ";
done

Same issue with 5.7.27.

Suggested fix:
Explain why 5.7.26+ detects more far more deadlocks and if that's a regression bug, fix it.
[23 Jan 7:53] MySQL Verification Team
Hello Przemyslaw,

Thank you for the report and test case.
Other than 5.7.26 change log(as mentioned already in report i.e Two sessions concurrently executing an INSERT ... ON DUPLICATE KEY UPDATE operation generated a deadlock. During partial rollback of a tuple, another session could update it. The fix for this bug reverts fixes for Bug #11758237, Bug #17604730, and Bug #20040791. (Bug #25966845)) - https://dev.mysql.com/doc/relnotes/mysql/5.7/en/news-5-7-26.html I couldn't locate any other info to correlate this behavior since 5.7.26+

regards,
Umesh
[23 Jan 7:54] MySQL Verification Team
Test results - 5.7.25/26/29 and 8.0.19

Attachment: 98324.results (application/octet-stream, text), 13.77 KiB.

[6 Feb 3:10] Stephen Wei
Hello,

Here is another test case showing different behaviors since 5.7.26.
The test case works well on 5.7.25 but get a deadlock error on 5.7.26 and later.

How to reproduce:
1. Create a docker network for the test.

docker network create net-mysql

2. Create containers for test versions.

for MINOR in 25 26 29; do docker run --network net-mysql --name mysql57-$MINOR -e MYSQL_ROOT_PASSWORD=r00t -d mysql:5.7.$MINOR; done

3. Run tests

docker run --rm -it --network net-mysql python:3.7 bash

# in container
pip install mysql-connector-python

# COPY deadlock-test.py
python3 deadlock-test.py 25 # no exception, return code 0
python3 deadlock-test.py 26 # Deadlock found when trying to get lock; try restarting transaction
python3 deadlock-test.py 29 # Deadlock found when trying to get lock; try restarting transaction

(deadlock-test.py will be attached later)
[6 Feb 3:13] Stephen Wei
The test script for daedlock.

Attachment: deadlock-test.py (text/x-python-script), 2.45 KiB.

[2 Mar 19:15] Thanh Nguyen
I just faced the same issue. These two batch "Insert on update query" transactions that works in 5.7.16 gets deadlock when we update the database to 5.7.26. 

'INSERT INTO ttable (`foo`) VALUES (699422), (699421) ON DUPLICATE KEY UPDATE foo = VALUES (foo)'
'INSERT INTO ttable (`foo`) VALUES (699439), (699439) ON DUPLICATE KEY UPDATE foo = VALUES (foo)'
id is primary key, foo is unique key

I tried all MySQL version from 5.716 -> 5.7.25 and they are all fine. So I also believe this change is the reason of the deadlock:
"InnoDB: Two sessions concurrently executing an INSERT ... ON DUPLICATE KEY UPDATE operation generated a deadlock. During partial rollback of a tuple, another session could update it. The fix for this bug reverts fixes for Bug #11758237, Bug #17604730, and Bug #20040791. (Bug #25966845)"
[2 Mar 19:15] Thanh Nguyen
My Test script . run by ./deadlock.sh

Attachment: deadlock.sh (text/x-sh), 634 bytes.

[2 Mar 19:23] Thanh Nguyen
My analysis:

This change is the commit introduced in 5.7.26 https://github.com/mysql/mysql-server/commit/066b6fdd433aa6673622341f1a2f0a3a20018043
"This bug fix was just extending the gap locks for duplicates as a solution to this problem, which was not a foolproof solution. Therfore we reverted this fix"

The reverted commit is https://github.com/mysql/mysql-server/commit/c93b0d9a972cb6f98fd445f2b69d924350f9128a:
"When INSERT ... ON DUPLICATE UPDATE statement is executed, first, the record would be inserted into the clustered index followed by the secondary indexes. If a duplicate error is encountered, we do not stop processing.  We continue to process all the unique secondary indexes and place the necessary gap locks.
Once a duplicate error has been encountered, the records will not be actually inserted but only gap locks will be placed."

My short hypothesis:
5.7.4 (2014) applied a longer lock to help resolve the replication issue.
5.7.26 (2019) found a better approach for the replication, so the extra locks are removed.

Why 5.7.4 implementation has more aggressive locking but the deadlock happens more frequently when the change get reverted?
Before 5.7.4, the lock is extended for a longer period => the two batch transactions are fully isolated. In 5.7.26, the lock periods get shorter => 2 batch transactions are intervened and cause deadlock

My concern is will the doc still up-to-date with this change in 5.7.26 https://dev.mysql.com/doc/refman/5.7/en/innodb-locks-set.html . 

"INSERT ... ON DUPLICATE KEY UPDATE differs from a simple INSERT in that an exclusive lock rather than a shared lock is placed on the row to be updated when a duplicate-key error occurs. An exclusive index-record lock is taken for a duplicate primary key value. An exclusive next-key lock is taken for a duplicate unique key value."

Seem like the "exclusive next-key lock" is the extra locking that gets removed after the change in 5.7.26?
[5 Mar 10:25] Jakub Lopuszanski
Impressive investigation - thanks! :)

First of all, when talking about these particular bugs, we have to completely ignore semantics of their titles.
The titles were set by initial reporters who noticed some particular behaviour which they considered buggy. 
Many people view deadlocks as something very bad so keep reporting them. 
I want to stress that deadlock is not a bug per se.
But new strange behaviour might be indicative of some bug underneath, so we often investigate.
This is how it turned out that deadlocks were just (perhaps anoying) symptoms of much bigger problem.
In particular the Bug #25966845 INSERT ON DUPLICATE KEY GENERATE A DEADLOCK has a title complaining about a deadlock, but deadlocks are a fact of life and not a bug per se.
However, when investigating what exactly went wrong, we've discovered that implementation does not provide enough locks which was much more serious problem! 
It could happen that knowing what queries transactions performed and in which order they committed was no longer enough to determine the final state of the DB,
which means among other things that replication could fail to produce exact copy of DB. 
Therefore the fix Bug #25966845 is actually about bringing back serializability guarantees by adding more locks. 
And more locks can (as expected) cause more deadlocks in some scenarios.

(Following example comes from Thirunarayanan Balathandayuth, thanks!)
Imagine a situation with a table:
```
create table t1(
  f1 int auto_increment primary key,
  f2 int unique key,
  f3 int
);
```
and a single row in it:
(1, 10, 100)

And that there are two parallel connections:

con1:
    begin;
    insert into t1 values(2, 10, 200) on duplicate key update f3 = 120;
    commit;
    
con2:
    begin;
    insert into t1 values(2, 20, 300) on duplicate key update f3 = 500;
    commit;

And lets say we know for sure that con2 committed before con1 committed, but otherwise are not sure about interleaving of individual queries (and their internals).
What is the end state of the DB?

Turns out, that before the bugfix it was non-deterministic!
If con1 executed the whole insert before con2 started, then con1 would see a conflict on f2=10, and thus update the only row to (1, 10, 120).
Then con2 would see no conflict, and inserted (2,20,300), so the end state would be:
(1,10,120)
(2,20,300)
But, later during replication, knowing that con2 committed before con1, we would execute them in opposite order: 
First con2's insert would see no conflict, and insert (2,20,300).
Then con1 would see a conflict, but this time on f1=2, (as opposed to f2=10, which it saw on the leader) 
and thus it would decide that the conflicting row is the (2,20,300), 
and update it to (2,20,120), so that the end state on replica would be:
(1,10,100)
(2,20,120).
Not cool.
The difficult thing we are dealing here with is that if there are multiple constraints then it is very important which particular constraint was violated (f1 or f2?),
because this affects which row will be updated by the ON DUPLICATE KEY UPDATE clause (the one with conflict on f1 or the one with conflict on f2?).

It was identified, that in order to provide serializability we need to "lock everything we saw to make decision".
Observe that on the leader the con1 made a decision that the conflict is on f2 (as opposed to f1) by first observing that there is no row with f1=2.
How did it observe it? By temporarily creating a record in primary index with f1=2 and seeing that it succeeded.
Only later, when it found a conflict on f2 and decided to remove the record.
By removing the record, con1 removed a part of "proof" that it saw no conflict on f1.
This is why con2 was able to insert a row with f1=2, which later lead to replication issues.
The correct way to "preserve the evidence" would be to make sure that the gap in which f1=2 was remains locked until con1 commits.
This can be accomplished by creating explicit lock on the temporary row, and then let it be inherited on removal.
Such explicit lock can be created by so called implicit-to-explicit conversion.
The con1 already had an implicit lock on the record, because its TRX_ID was in the row's header as the id of the trx which written it.
But this implicit lock will vanish as soon as we remove the record physically, so it needs to be saved as an explicit lock in memory.

I hope this gives a glimpse of what we are dealing with here, and what we try to accomplish :)
[5 Mar 10:26] Jakub Lopuszanski
I'd like to provide some more details from my analysis of deadlock.sh run.
I'll be using 8.0.21-tr, because I understand 8.0.x source better than 5.7.x, and it has nice performance_schema.data_locks.
I believe that we still (at least in 8.0.21) take "an exclusive next-key lock for a duplicate unique key value" so docs (at least for 8.0) are correct.
I think it should be quite similar for 5.7.x (as both were fixed in parallel), and I have it on my TODO list to verify it.

During deadlock.sh run the situation on 8.0.21-tr could be like this:
```
mysql> select * from dbtest.ttable;
+----+--------+
| id | foo    |
+----+--------+
|  4 | 699421 |
|  3 | 699422 |
|  1 | 699439 |
+----+--------+
3 rows in set (0.00 sec)
```

The row with id=2 is missing because transaction2 inserts foo:699439 twice, which increments the autoinc twice, but one of the rows is not used.
One can reproduce the deadlock even if transaction2 inserts two different rows foo:699439, foo:699438, so this is not the most important thing here.

Also, depending on timing of the very first two queries in the first iteration of the loop in deadlock.sh you might alternatively see quite different values, for example:
```
mysql> select * from dbtest.ttable;
+----+--------+
| id | foo    |
+----+--------+
|  3 | 699421 |
|  2 | 699422 |
|  1 | 699439 |
+----+--------+
3 rows in set (0.00 sec)
```
as 8.0 uses innodb_autoinc_lock_mode = 2 (“interleaved” lock mode) as the default way of handling auto increment.
Let's assume the table state is as in the latter one and see how a deadlock looks like.

If I place a breakpoint in `lock_wait_check_candidate_cycle()` (when a probable deadlock cycle is found, but just before it acquires latches preventing any activity in lock sys),
freeze this thread, unpause execution for all other threads and from a separate client check the `performance_schema.data_locks` I can see:

```
mysql> SELECT ENGINE_TRANSACTION_ID,OBJECT_NAME,INDEX_NAME,LOCK_TYPE,LOCK_MODE,LOCK_STATUS,LOCK_DATA FROM performance_schema.data_locks;
+-----------------------+-------------+--------------+-----------+--------------------+-------------+------------------------+
| ENGINE_TRANSACTION_ID | OBJECT_NAME | INDEX_NAME   | LOCK_TYPE | LOCK_MODE          | LOCK_STATUS | LOCK_DATA              |
+-----------------------+-------------+--------------+-----------+--------------------+-------------+------------------------+
|                 12298 | ttable      | NULL         | TABLE     | IX                 | GRANTED     | NULL                   |
|                 12298 | ttable      | index_unique | RECORD    | X                  | GRANTED     | 699439, 1              |
|                 12298 | ttable      | PRIMARY      | RECORD    | X,REC_NOT_GAP      | GRANTED     | 1                      |
|                 12298 | ttable      | PRIMARY      | RECORD    | X                  | GRANTED     | supremum pseudo-record |
|                 12298 | ttable      | PRIMARY      | RECORD    | X,INSERT_INTENTION | WAITING     | supremum pseudo-record |
|                 12297 | ttable      | NULL         | TABLE     | IX                 | GRANTED     | NULL                   |
|                 12297 | ttable      | index_unique | RECORD    | X                  | GRANTED     | 699422, 2              |
|                 12297 | ttable      | PRIMARY      | RECORD    | X,REC_NOT_GAP      | GRANTED     | 2                      |
|                 12297 | ttable      | PRIMARY      | RECORD    | X                  | GRANTED     | supremum pseudo-record |
|                 12297 | ttable      | PRIMARY      | RECORD    | X,INSERT_INTENTION | WAITING     | supremum pseudo-record |
+-----------------------+-------------+--------------+-----------+--------------------+-------------+------------------------+
10 rows in set (0.00 sec)
```
Couple of notes on how to interpret this:
ENGINE_TRANSACTION_ID:
    for transactions which are known to be not-read-only this is the id of the trx which requested the lock
    (for transactions which are still possibly read-only you'll see some internally assigned temporary value)
INDEX_NAME:
    NULL - table locks are for whole table
    index_unique - the secondary index name
    PRIMARY - the clustered index    
LOCK_MODE:
    X,REC_NOT_GAP = exclusive lock on the record itself, not on the gap (a.k.a. Record Lock)
    X = exclusive lock on the record and the gap before it (a.k.a. Next Key Lock)
    X,INSERT_INTENTION = a lock enqueued by transaction which waits until a conflicting lock on the gap it wants to insert to is released
LOCK_STATUS:
    GRANTED - the transaction owns the lock
    WAITING - the transaction is waiting until conflicting lock is released
LOCK_DATA:
    for secondary index - values of columns explicitly mentioned in the index definition followed by any remaining primary key columns
    for primary index - all the columns of the primary key
    NULL - for table locks (obviously), 
               and whenever it was impossible to read column values due to the buffer pool page being in use by another thread at the time of quering performance_schema

More than one transaction can have an X lock on supremum pseudo-record GRANTED, because:
(1) Locks on supremum pseudo-record are always GAP-only locks. 
    They protect the gap after the last phisical record on page (before the imaginary "supremum" pseudo-record), and
(2) More than one transaction can have a GRANTED lock on any gap, as their purpose is just to prevent INSERTs into the gap,
    so if several transactions want to prevent the same thing from happening then there is no conflict.

Both transactions perform INSERTs of more than one row.
When attempting to INSERT the second row, they try to do so into the gap at the end of the page which is protected by the aforementioned X locks on (a gap before) supremum pseudo-record.
Thus they have to start WAITING.
This leads to a deadlock, as both transactions need X,INSERT_INTENTION to the gap that they both guard.

So, one can say that the problem (if any) is not in the handling of the first inserted row, but rather in that the second one is prevented from insert.
Indeed if you modify deadlock.sh to insert just one row from each connection there is no deadlock observed.
[5 Mar 10:28] Jakub Lopuszanski
The X lock on supremum pseudo-record in PRIMARY index appears in a quite complicated way.
To explain it step by step, let's assume that the state of the table is as in the deadlock.sh situation:
```
mysql> select * from dbtest.ttable;
+----+--------+
| id | foo    |
+----+--------+
|  3 | 699421 |
|  2 | 699422 |
|  1 | 699439 |
+----+--------+
3 rows in set (0.00 sec)
```
and that I am performing a simple:
```
INSERT INTO ttable (`foo`) VALUES (699422) ON DUPLICATE KEY UPDATE foo = VALUES (foo)
```
Here's the timeline of events related to locking:
1. `lock_table_create(...,LOCK_IX,...)` is called to create Intention lock on the table, 
    which is just to let other threads interested in locking whole table that there is someone modifying its rows down here 
    (actually, this is a bit unnecessary, given that MySQL layer above InnoDB guards against this too)
2. `lock_rec_insert_check_and_lock(...,rec,..,{name:"PRIMARY"})` is called where `rec` is just before `PAGE_HEAP_NO_SUPREMUM`.
    As there are no other transactions at this point holding any locks, this request to INSERT into the gap before supremum pseudo-record is granted in a fast-path 
    which does not involve creating an explicit X,INSERT_INTENTION lock - the thread simply moves on to insert the row, and we rely on buffer pool page's latch during
    this operation to prevent interference from others, and afterwards, we rely on TRX_ID stored inside the row,which acts as implicit lock.
3. `lock_sec_rec_read_check_and_lock(lock_duration_t::AT_LEAST_STATEMENT,...,rec,{name:"index_unique"},...,LOCK_X,LOCK_ORDINARY)` gets called during `row_ins_scan_sec_index_for_duplicate()`, 
    a function which scans secondary index for duplicates.
    `LOCK_ORDINARY` means we want to lock "rec+gap", a.k.a. Next-key Lock.
    The `rec` has heap_no==3 and `(rec[1]*256+rec[2])*256+rec[3]` is 699422 and `rec[11]` is 2.
    Note how this matches the <foo:699422,id:2> entry in the secondary index.
    3.1. It calls `lock_protect_locks_till_statement_end()` to make sure that even in the low isolation levels (READ COMMITTED and READ UNCOMMITTED) the locks created during this statement
         will be properly inherited during record phisical removal as gap locks on the gap before next record, even though these isolation levels usually don't use GAP locks at all.
    3.2. It calls `lock_rec_lock(..,SELECT_ORDINARY,LOCK_X,..,heap_no=3,{name:"index_unique"},..)` which suceeds and creates a Next-key lock which corresponds to:
```
|                 12297 | ttable      | index_unique | RECORD    | X                  | GRANTED     | 699422, 2              |
```
         we saw earlier.
4. `lock_rec_convert_active_impl_to_expl(...,rec,..,{name:"PRIMARY"},heap_no=5)` gets called during `trx_rollback_to_savepoint()` when `row_mysql_handle_errors()` handles `DB_DUPLICATE_KEY` error.
    The `rec[6]*256+rec[7]` is 590 which is the `id` assigned by autoincrement to this temporary row (remember I've run the deadlock.sh before which pumped up the autoinc).
    The heap_no:5 in general should not be interpreted without knowing the history of heap allocations and dealocations as they might get reused in complicated way, 
    but in a "simple" history, heap_no:0 is infimum, heap_no:1 is supremum, heap_no:2 is smalles record (in our case:{id:1}), heap_no:3 would be {id:2}, heap_no:4 would be {id:3},
    and thus heap_no:5 matches our expectation of dealing with the largers phisical record on the page.
    4.1 It calls `lock_rec_add_to_queue(LOCK_REC | LOCK_X | LOCK_REC_NOT_GAP,..,heap_no=5,{name:"PRIMARY"}) which creates a Record Lock, 
        one that we can not see in the performance_schema report above, unfortunately, as that query was performed at a later moment.
    However, we can freeze this thread at this point, unpause other threads, and use separate connection to perform similar query, to see:
```
mysql> SELECT ENGINE_TRANSACTION_ID,OBJECT_NAME,INDEX_NAME,LOCK_TYPE,LOCK_MODE,LOCK_STATUS,LOCK_DATA FROM performance_schema.data_locks;
+-----------------------+-------------+--------------+-----------+---------------+-------------+-----------+
| ENGINE_TRANSACTION_ID | OBJECT_NAME | INDEX_NAME   | LOCK_TYPE | LOCK_MODE     | LOCK_STATUS | LOCK_DATA |
+-----------------------+-------------+--------------+-----------+---------------+-------------+-----------+
|                 12490 | ttable      | NULL         | TABLE     | IX            | GRANTED     | NULL      |
|                 12490 | ttable      | index_unique | RECORD    | X             | GRANTED     | 699422, 2 |
|                 12490 | ttable      | PRIMARY      | RECORD    | X,REC_NOT_GAP | GRANTED     | NULL      |
+-----------------------+-------------+--------------+-----------+---------------+-------------+-----------+
3 rows in set (0.00 sec)
``` 
       the NULL in the LOCK_DATA means that performance_schema was unable to obtain the record data (which is understandable because our thread has a latch on the buffer pool page with the record)
       and thus can not report the exact value of the columns. By cheating a little:
```
mysql> SELECT ENGINE_LOCK_ID,INDEX_NAME,LOCK_MODE FROM performance_schema.data_locks;
+------------------------------------+--------------+---------------+
| ENGINE_LOCK_ID                     | INDEX_NAME   | LOCK_MODE     |
+------------------------------------+--------------+---------------+
| 2772975965448:1077:2772943755656   | NULL         | IX            |
| 2772975965448:20:5:3:2772943752744 | index_unique | X             |
| 2772975965448:20:4:5:2772943753096 | PRIMARY      | X,REC_NOT_GAP |
+------------------------------------+--------------+---------------+
3 rows in set (0.00 sec)
```
       we can learn that this is actually the lock on heap_no 5, as we can see that  ENGINE_LOCK_ID==2772975965448:20:4:5:2772943753096 and this :5: part
       together with knowlege of current internal implementation(..:space_id:page_id:heap_no:..) proves that.
[5 Mar 10:29] Jakub Lopuszanski
5. `lock_rec_inherit_to_gap(..,..,heir_heap_no=1,heap_no=5)` gets called as part of `lock_update_delete()` which updates locks when a row gets phisically deleted (as opposed to simply delete-marked).
       The reason we can phisically remove the row is because we are in the middle of `row_undo_ins()` which as the documentation says:
```
/** Undoes a fresh insert of a row to a table. A fresh insert means that
 the same clustered index unique key did not have any record, even delete
 marked, at the time of the insert.  InnoDB is eager in a rollback:
 if it figures out that an index record will be removed in the purge
 anyway, it will remove it in the rollback.
```
        so we know that no active read view needs this row, as it hasn't existed before we started.
        As said before heap_no 1 always means page supremum pseudo-record, thus the function was asked to inherit locks from the row being removed (heap_no=5) to supremum pseudo-record (heir_heap_no=1).
        Which makes perfect sense, given that it was the largest phisicall row, so after its removal its old spot will belong to the gap before supremum.
        Given that `lock_protect_locks_till_statement_end()` was called earlier, even in low isolation levels, this operation will lead to calling
        5.1 `lock_rec_add_to_queue(LOCK_REC | LOCK_GAP | LOCK_X,...,heap_no:1,{name:"PRIMARY"},...)`.
             Which by convention used in InnoDB will ignore LOCK_GAP flag for heap_no:1, as its assumed that locks on supremum are always gap locks.
             At this point our locks look like this:
```
mysql> SELECT ENGINE_TRANSACTION_ID,OBJECT_NAME,INDEX_NAME,LOCK_TYPE,LOCK_MODE,LOCK_STATUS,LOCK_DATA FROM performance_schema.data_locks;
+-----------------------+-------------+--------------+-----------+-----------+-------------+------------------------+
| ENGINE_TRANSACTION_ID | OBJECT_NAME | INDEX_NAME   | LOCK_TYPE | LOCK_MODE | LOCK_STATUS | LOCK_DATA              |
+-----------------------+-------------+--------------+-----------+-----------+-------------+------------------------+
|                 12490 | ttable      | NULL         | TABLE     | IX        | GRANTED     | NULL                   |
|                 12490 | ttable      | index_unique | RECORD    | X         | GRANTED     | 699422, 2              |
|                 12490 | ttable      | PRIMARY      | RECORD    | X         | GRANTED     | supremum pseudo-record |
+-----------------------+-------------+--------------+-----------+-----------+-------------+------------------------+
3 rows in set (0.00 sec)
```
             the 2772975965448:20:4:5:2772943753096 lock is gone, and now there is 2772975965448:20:4:1:2772943753448
6. `lock_sec_rec_read_check_and_lock(lock_duration_t::REGULAR,...,rec,{name:"index_unique"},...,LOCK_X,LOCK_REC_NOT_GAP)` gets called
        The `rec` has heap_no==3 and `(rec[1]*256+rec[2])*256+rec[3]` is 699422 and `rec[11]` is 2. Note how this matches the <foo:699422,id:2> entry in the secondary index.
        You may wonder why are we here again, haven't we already locked this step 3.?
        Well, then it was a different lock mode and reason - 3 was during uniqueness check, and now we are reading secondary index value, before we can modify it and write it back.
        That is we are handling the "UPDATE" part now, and we want to update the secondary index which in InnoDB requires to first fetch the secondary index entry.
        Yes, this all seems a bit wasteful, given that the row wasn't actually modified at all (new and old `foo` are the same),
        but this code doesn't seem to be very optimized for the "foo=VALUES(foo) on conflict on foo" case, which I can forgive, given how strange it looks to me, too.
        Also AFAIU at this point we don't really know the old conflicting rows' values, and this is precisely how learn them: to fetch them we (might) need to look into PRIMARY index,
        but to know the id to search for we have to first consult secondary index.
        As expected the function concludes we already have an even stronger lock, and does nothing.
7. `lock_clust_rec_read_check_and_lock(lock_duration_t::REGULAR,..rec,{name:"PRIMARY"},..,LOCK_X,LOCK_REC_NOT_GAP..)` gets called as part of the "we need to read the row before write"
        Again `rec[6]*256+rec[7]` is 2.
        7.1. It has to perform `lock_rec_convert_impl_to_expl()` check
             If you wonder how come we haven't checked for implicit locks during previous steps, here is my explanation:
               1. `lock_table_create` table locks are a completely separate things from record locks,
               2. `lock_rec_insert_check_and_lock` doesn't care about locks on records themselves, just gaps
               3. `lock_sec_rec_read_check_and_lock` is smart enough to figure out by reading the max trx id on the page that nobody was touching it
               4. `lock_rec_convert_active_impl_to_expl` *assumes* that it is the active trx holding implicit lock (hence the name!)
               5. `lock_rec_inherit_to_gap` this one is interesting one..IIRC there was some discussion if perhaps it should convert implicit-to-explicit before inheriting them, TODO: I need to investigate this
               6. `lock_sec_rec_read_check_and_lock` same as 4.
             The check concludes the row was not modified by active transaction, so no conversion is needed.
        7.2. It calls `lock_rec_lock(..,SELECT_ORDINARY,..,LOCK_REC_NOT_GAP|LOCK_X,heap_no=3,{name:"PRIMARY"}) and successfully creates another lock, so now we have:
```
mysql> SELECT ENGINE_TRANSACTION_ID,OBJECT_NAME,INDEX_NAME,LOCK_TYPE,LOCK_MODE,LOCK_STATUS,LOCK_DATA FROM performance_schema.data_locks;
+-----------------------+-------------+--------------+-----------+---------------+-------------+------------------------+
| ENGINE_TRANSACTION_ID | OBJECT_NAME | INDEX_NAME   | LOCK_TYPE | LOCK_MODE     | LOCK_STATUS | LOCK_DATA              |
+-----------------------+-------------+--------------+-----------+---------------+-------------+------------------------+
|                 12490 | ttable      | NULL         | TABLE     | IX            | GRANTED     | NULL                   |
|                 12490 | ttable      | index_unique | RECORD    | X             | GRANTED     | 699422, 2              |
|                 12490 | ttable      | PRIMARY      | RECORD    | X,REC_NOT_GAP | GRANTED     | 2                      |
|                 12490 | ttable      | PRIMARY      | RECORD    | X             | GRANTED     | supremum pseudo-record |
+-----------------------+-------------+--------------+-----------+---------------+-------------+------------------------+
4 rows in set (0.00 sec)
```

That's all steps, related to locking, I hope! :)
[5 Mar 10:29] Jakub Lopuszanski
OK, so how did it look before the Bug #25966845 fix?
Let me check on a6c1dd2c99f4a commit just before this fix, which was in the 8.0.15-tr era.

(I'm on VS 2019, so compiling this old code requires backporting include/boost_1_68_0/patches/boost/proto/generate.hpp and many changest to router/{src,test} but this does not impact the experiment in any way)

First of all deadlock.sh does not report any deadlocks - which agrees with what we'd expect.
So, how the steps involved in the
```
INSERT INTO ttable (`foo`) VALUES (699422) ON DUPLICATE KEY UPDATE foo = VALUES (foo)
```
differ?

In step 3. `lock_sec_rec_read_check_and_lock` does not have lock_duration_t support yet, so it will not call `lock_protect_locks_till_statement_end`. 
However there were other (more buggy) ways of making sure that locks created for constraint checks will be inherited anyway, so don't panic, just yet.

There is no step 4. as `lock_rec_convert_active_impl_to_expl` does not yet been implemented, so we don't convert implicit lock to explicit lock for the record we are about to phisically remove (ups!).

So, we go to step 5. `lock_rec_inherit_to_gap(..,..,heir_heap_no=1,heap_no=5)` which unfortunately sees no explicit locks on the `heap_no==5` 
(because we have not called `lock_rec_convert_active_impl_to_expl`), so decides there is nothing to inherit (ups!).

At this point we only have:
```
+-----------------------+-------------+--------------+-----------+-----------+-------------+-----------+
| ENGINE_TRANSACTION_ID | OBJECT_NAME | INDEX_NAME   | LOCK_TYPE | LOCK_MODE | LOCK_STATUS | LOCK_DATA |
+-----------------------+-------------+--------------+-----------+-----------+-------------+-----------+
|                  2346 | ttable      | NULL         | TABLE     | IX        | GRANTED     | NULL      |
|                  2346 | ttable      | index_unique | RECORD    | X         | GRANTED     | 699422    |
+-----------------------+-------------+--------------+-----------+-----------+-------------+-----------+
2 rows in set (7.46 sec)
```

Step 6. looks similar (although there is no `duration` argument).
Step 7. is exactly the same

So, as you can see, the difference in behaviour is precisely where we should expect it:
the old code forgot to convert implicit lock on the temporarly inserted row to explicit lock, 
and thus it was not inherited and thus there is no gap protecting the proof that there was no conflict on primary key.
(Note, that in case of autoincrement column, one could perhaps argue, that we don't need such proofs and gap locks,
but in such a complicated system as InnoDB we prefer to do things in generic way whenever possible, 
so we simply apply the same rule to clustered index records regardless if their ID was given to us by user (as in the f1=2 case!) or by autoincrement sequence generator.)
[5 Mar 10:29] Jakub Lopuszanski
As further proof that both versions of code create Next Key Locks on secondary index I've inspected the code at rev a6c1dd2c99f4a.
The decision that the lock on secondary index in step 3. covers also the gap (not just record) follows from:
```
    const ulint lock_type =
        index->table->skip_gap_locks() ? LOCK_REC_NOT_GAP : LOCK_ORDINARY;
```
and that it is X (not S) follows from:
```
    } else if (allow_duplicates) {
```
At most recent mysql-trunk this logic is the same.

So, the bug fix did not change the kind of lock taken for secondary index in any way known to me.

What it DID change however is the kind of lock used for row_ins_duplicate_error_in_clust() which checks for conflicts in PRIMARY (a.k.a. clustered) index.
Before the bug fix we used a lock covering both record and gap (Next Key Lock), and after the fix: just the record.
This change follows from realisation, that in case we have to remove the row we will properly inherit the lock as a gap lock, and if we don't remove it, then there is no need to protect the gap.
So, this should increase parallelism. 
This change was made possible by properly making sure that locks are converted from implicit to explict before removal of temporarily inserted row. 

OK, but if the pre-bugfix version of row_ins_duplicate_error_in_clust() function created X locks covering both row and gap, then how come there was any problem with the "f1=2 gap locking" in the first place?
Well, this function is not called at all in case the cursor found no conflicting row.
That is, this function only locks rows which already existed before insert which are proof of conflict, but it does not lock the gap which is a proof of no conflict - it's not its responsibility.

Hope it helps.
[5 Mar 10:30] Jakub Lopuszanski
Now, for the original example from Przemyslaw Malkowski's bug report.
It seems to be very different from deadlock.sh, in particular each transaction seems to be inserting just one row,
and we saw that the source of deadlock in deadlock.sh was that each transaction tried to insert two.
That is the problem was when they wanted to insert the second row into the gap which was protected as evidence required for decisions taken during insert of the first one.
If you think about what steps are involved in 
```
REPLACE INTO test SET id=NULL, a='foobar', b = 2, d = 'boobar'
```` 
when (a,b,d) conflicts with already existing row, you will realize that the situation is actually quite similar!
Why?
Because:
1. we need to preserve evidence that the conflict WAS NOT on id column, so again we create a gap lock before supremum pseudo-record
2. we need to "update" the old row with (a='foobar',b=2,d='boobar') with the new id, BUT...
3. to update PRIMARY KEY of a row you actually need to remove old one and insert a new one, as there is no simple way to modify a B-tree leaf in place
So, there you have it: we need to perform an insert into the gap that is protected not only by us, but possibly also by a concurrent query.

Arguably, in both situations (original report and deadlock.sh) one might try to figure out a new, improved design, in which we try to handle a situation where autoincrement is involved in some other way,
to exploit the intuitive notion that autoincrement values can not cause duplication and thus we don't have to preserve evidence of the lack of conflict for them.
I estimate that it would be highly non-trivial to implement correctly, and IMHO classifies as a feature request (as opposed to bug).

Best regards,
Kuba
[5 Mar 12:10] Jakub Lopuszanski
May I also propose a workaround for Przemyslaw Malkowski (not sure if practical), instead of:
```
REPLACE INTO test SET id=NULL, a='foobar', b = 2, d = 'boobar'
```
you could perhaps use this:
```
BEGIN;
DELETE FROM test WHERE a='foobar' AND b = 2 AND d = 'boobar'; 
INSERT INTO test SET id=NULL, a='foobar', b = 2, d = 'boobar';
COMMIT;
```
I hope this gives the same result? (Which AFAIU is some form of "move to front"  each time we touch a,b,d?).
Such transactions may also deadlock from time to time (in particular if full scan is used, I guess), but in my testing it seems to be much less frequent than with the original REPLACE.
Intuitively, the reason why this should create less deadlocks is that now the locks are taken in following order:
1. lock the secondary index record (without any gaps)
2. lock the primary index record (without any gaps)
3. remove both records
4. create a new record at the end of the primary index (with implicit lock)
5. create a new record in secondary index (in the process locking gaps around the removed and added records)
Even if another transaction does same steps in parallel (but for different secondary index values) it will lock a different gap in step 1, a different record in step 2, and in step 4 both will succeed as the gap before supremum is not locked, and only in step 5. one of them might be forced to wait because latching the region in secondary index corresponding to given secondary index values is done in a way which latches one record too much (see Bug #98639 "Redundant row-level locking for non-unique index" and https://forums.mysql.com/read.php?22,684356,684482#msg-684482).
But this will not cause a deadlock - just a delay.
So, the main difference between this approach and the one with REPLACE, is that we do not create the subsequence of:
- create row at the end of table
- delete row at the end of table
- create row at the end of table
which leads to deadlocks when done in parallel, because the delete creates a gap lock which has shared lock semantics (even if "officially" X).
Usually, if you go back-and-forth in the shared-lock(A)->something->exclusive-lock(A) fashion you risk a deadlock during the second lock acquisition on A, as there might already be someone who shared-locked it.
So, you have to either avoid going back to A, or take exclusive-lock on A to begin with, but that is impossible in current implementation and interpretation of gap locks.
Again, changing X GAP locks to be really "exclusive" (as in: prohibiting others from acquiring X GAP lock) would be a feature-request.
And so would be avoiding the spurious "off-by-one" lock on record in the secondary index beyond the protected range.
[6 Mar 12:18] Thanh Nguyen
Thank you Jakub Lopuszanski very much. So clear and detailed explanation.
[1 Apr 14:14] Przemyslaw Malkowski
Jakub, thank you very much for detailed explanation and for the suggestions. However the test case was more to prove that in general deadlocks are more often in a simple way, not a real production case.
[4 Sep 0:56] Yoshimi Takahashi
Hello,

I wait for fix of this problem.
Is it possible to fix ?
[4 Sep 2:04] Donghoon Lee
We're having the same problem. Please give me feedback.
[4 Sep 10:11] Jakub Lopuszanski
Hello, as I wrote above:

> one might try to figure out a new, improved design, in which we try to handle a situation where autoincrement is involved in some other way,
to exploit the intuitive notion that autoincrement values can not cause duplication and thus we don't have to preserve evidence of the lack of conflict for them.
> I estimate that it would be highly non-trivial to implement correctly, and IMHO classifies as a feature request (as opposed to bug).

"Deadlock happens" is not a bug - deadlocks should be handled by a properly written application, as they are one of possible outcomes of each transaction.

"Deadlocks happen more often than before" is also not a bug (Yes, it might be a symptom of a bug if the change was a surprising result of some unrelated change we haven't thought through. But in our case, the deadlocks happen more often because we lock more often, and this was an intended change)

"I'd like deadlocks to happen less often" is a feature request in my opinion, unless one can point out where we make a mistake by taking a lock.

It would really help if you provide real-world queries and what deadlock cycles you observe. As you can see above, I was quite serious in devoting time to analysing what happens exactly in each situation reported by users. (Also, I was a bit disappointed, that in the end the response was: "yeah, but this wasn't really a representative query". Then, please provide representative queries, so I can help.)

I suspect from the examples I've seen so far, that what people are trying to do here is inserting many rows in single query (or in several queries but inside a single transaction) to a table with secondary unique indexes, autoincrement, and either "ON DUPLICATE ..." clause or "REPLACE INTO.." or "INSERT IGNORE .." from multiple threads.
These situations share some characteristics:
- they try to insert rows into same range of primary key (because AUTOINCREMENT attempts to place rows at the end)
- InnoDB needs to "remember" which column caused a conflict (as there is more than one unique column) and thus it needs to lock a sufficient fragment of the table to protect "the proof" that the conflict didn't happen on some columns but did happen on some other column (arguably INSERT IGNORE doesn't really need to know which column caused a conflict, it would suffice to know that ANY conflict happened, but it's unclear to me how to achieve it using locking primitives alone)

This leads to problems because AUTOINCREMENT column rarely causes a conflict, thus PRIMARY key needs to be locked to prove there was no conflict, and unfortunately the locked part is exactly the same as the inserts are going to (because of AUTOINCREMENT).

Possible workarounds:
A) Try inserting one row at a time. 
This is perhaps too slow. Also it doesn't help if the PRIMARY KEY needs updating as seen in Przemysław's case of REPLACE INTO
B) Try replacing the "REPLACE INTO ..." with a two step transaction BEGIN; DELETE rows which conflict; INSERT new rows; COMMIT. 
This solves the problem with A).
Note: this helps because we are implicitly conveying our background knowledge that the conflict can not be on the PRIMARY KEY column, thus we avoid locks on it.
Note: this doesn't check for conflicts within/among the new rows themeselves - you either have to check for that yourself in application, or use a more complicated query.
C) Try replacing the "INSERT...ON DUPLICATE KEY UPDATE .." with a transaction
BEGIN; SELECT rows conflicting on secondary indexes; UPDATE found conflicting rows; INSERT new rows; COMMIT; 
Note: this has similar benefits and problems as above.
D) Run server with `--innodb-autoinc-lock-mode=0`. 
This solves the problem by introducing a different problem: the transactions which try to insert at the end of the table will now compete for an explicit AUTOINC lock on the table, and thus they will be allowed to operate one at a time, which eliminates chance for the deadlock.
However, this is clearly suboptimal for "regular" operation of the database. Might be worth a shot during data import scenarios, though.
(BTW. I believe this is what has happened to Liz from HoneycombIO https://twitter.com/lizthegrey/status/1291071492894203904 although she didn't seem to mention having unique secondary index, nor inserting more than one row per transaction. I base my belief on analysing her deadlock log and the code)
E) Try prepending BEGIN; SELECT MAX(id) FROM .. FOR UPDATE; to the begging of your query/transaction. 
This will make sure that transactions will perform their actions in the order in which they were granted the lock on row with maximal id.
Actually you can use any other row, table, or even GET_LOCK or even a latch inside your app for this purpose.
F) Run the inserting transactions with READ COMMITTED isolation level 
This will simply turn of the mechanism of protecting the gap after the record which caused a conflict, which I believe to be unsafe for replication (at least, for statement-based replication - I am not a replication expert though).
G) Try to use INSERT... ON DUPLICATE KEY UPDATE... instead of REPLACE INTO .. and then apply A)
The difference is that instead of creating a new row with new id and deleting the old one with old id, you will modify the old record in place (which should be faster, and less deadlock-prone).
This assumes some changes in application to handle the situation that id might be the old one (and not the new one generated by autoincrement).
H) Consider using the UNIQUE column as the id
This might be a big change to the application, but if you care for a column to be unique (you have UNIQUE constraint on it), but don't care much about id values (you use AUTOINC) then perhaps a reasonable change would be to use the unique column to identify rows?
I) Try to insert from one thread only
This might be too slow for some reason. 
(But it looks at least plausible to me, that you really need to serialize these inserts anyway if you are serious about maintaining uniqueness checks, at least in some cases, so maybe the change from multithreaded to singlethreaded import is not so big?)
J) Try to pre-allocate disjoint sets of PRIMARY KEY to the inserting threads
One could use some scheme, or external service to obtain id values guaranteed to be disjoint (MySQL/Redis/Memcached whatever) and then proceed with inserting into disjoint fragments of PRIMARY KEY from each inserter.
This could help avoid deadlocks, because GAP locks taken on PRIMARY KEY (if any) would protect disjoint gaps.
K) Shard the table and let each inserter work on specific shard
This actually sounds like a good idea to me anyway. You can either shard using modulo N, and --auto-increment-increment=N, or devote disjoint ranges of ids to each shard (say in a consistent hashing-like scheme).

I'm hope it doesn't sound like I believe that any of the above is trivial.
I just think each of above would be easier than changing the server code in this area.

If you want an improvement on the server side, then motivating, real-world examples of queries are welcomed, as this could be used as input for a design of a new feature.
Also, if you believe there is some bug in the locks we take, please provide example of queries involved and performance_schema.data_locks content (in case of 8.x) or at least full output of `show engine innodb status`.
[25 Oct 15:14] huahrw wer
Thank you Jakub Lopuszanski. Learned a lot from this.
However, as you said above  "Well, this function is not called at all in case the cursor found no conflicting row.
That is, this function only locks rows which already existed before insert which are proof of conflict, but it does not lock the gap which is a proof of no conflict - it's not its responsibility."
I tried the example in 8.0.12 MySQL Community Server and found in  pre-bugfix version ,if there is a conflicting row, and there is no gap locking. Instead, if there no conflicting at all, then there is gap locking.
Hence, "insert into t1 values(2, 10, 200) on duplicate key update f3 = 120;" found   conflict so there is no gap locking, and another transaction "insert into t1 values(2, 20, 300) on duplicate key update f3 = 500;" can execute.
I wonder if I understand it wrong .
[25 Oct 15:52] huahrw wer
Last Comment was Wrong, update:
I tried the example in 8.0.12 MySQL Community Server and found in  pre-bugfix version ,if there is a conflicting row, and there is a next-key lock(which just on the index record and a gap lock on the gap before the index record). Instead, if there no conflicting at all, then there is gap locks before the index record and after the index lock. 
so the Bug#25966845 also remove the locks when there is no conflicting at all?