MySQL Bugs: #116815: INSERT locks primary key supremum after secondary index duplicate key collision

Bug #116815	INSERT locks primary key supremum after secondary index duplicate key collision
Submitted:	28 Nov 2024 3:51	Modified:	4 Dec 2024 17:38
Reporter:	Daniel Nichter (OCA)	Email Updates:
Status:	Analyzing	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S2 (Serious)
Version:	8.0.30	OS:	Any
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
INSERT locks the primary key (PK) supremum after handling a secondary index duplicate key collision.
("After handling" == IGNORE ? commit or lock wait timeout : dupe key error)

A PK supremum lock blocks all new INSERT/rows, which is why severity S2.
This specific issue (PK supremum lock) was mentioned in bug 68021, but it seems to have gotten lost in the details.

All locks are released on commit, but apps don't always commit immediately. Under load (or triggered by a slow/stalled trx), this can cause a downward spiral: inserts pile up,; that makes them slower; that makes the app "push" harder; that increases the pileup and slowness; repeat until things grind to a halt.

This is _not_ an ON DUPLICATE KEY UPDATE (ODKU) issue.

This only occurs when the key collision is in a unique secondary index. (A PK collision creates an S,REC_NOT_GAP lock, nothing more.)

I'd like to think I know a fair bit about InnoDB locking, but after extensive testing, researching, and thinking about why a PK supremum lock might be needed, I cannot find or imagine a reason. If there is one, it'll be fascinating to learn. Else, if it's a bug, then it's probably gone undetected because it requires a relatively poor workload: excessive dupe key collisions.

How to repeat:
~~~~~
SETUP
~~~~~

CREATE TABLE `t` (
  `id` int NOT NULL AUTO_INCREMENT,
  `a` char(1) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `UNIQ_a` (`a`)
) ENGINE=InnoDB;

insert into t values (null,'a'), (null,'z');

~~~~~~~
EXECUTE
~~~~~~~

-- Trx 1
begin; insert into t (a) values ('m');

-- Trx 2
begin; insert into t (a) values ('m');
/* blocked... */

/* Data locks */
SELECT engine_transaction_id, index_name, lock_type, lock_mode, lock_status, lock_data FROM performance_schema.data_locks WHERE  object_name = 't' AND lock_type='RECORD' order by ENGINE_TRANSACTION_ID;
+-----------------------+------------+-----------+---------------+-------------+-----------+
| engine_transaction_id | index_name | lock_type | lock_mode     | lock_status | lock_data |
+-----------------------+------------+-----------+---------------+-------------+-----------+
|               1773773 | UNIQ_a     | RECORD    | X,REC_NOT_GAP | GRANTED     | 'm', 3    |
|               1773774 | UNIQ_a     | RECORD    | S             | WAITING     | 'm', 3    |
+-----------------------+------------+-----------+---------------+-------------+-----------+
/*
  NOTES: trx 1 (1773773) implicit-to-explicit conversion as expected; trx 2 (...74) waiting because trx 1 hasn't committed
*/

-- Trx 1
commit;

-- Trx 2
ERROR 1062 (23000): Duplicate entry 'm' for key 't.UNIQ_a'

/* Data locks */
SELECT engine_transaction_id, index_name, lock_type, lock_mode, lock_status, lock_data FROM performance_schema.data_locks WHERE  object_name = 't' AND lock_type='RECORD' order by ENGINE_TRANSACTION_ID;
+-----------------------+------------+-----------+-----------+-------------+------------------------+
| engine_transaction_id | index_name | lock_type | lock_mode | lock_status | lock_data              |
+-----------------------+------------+-----------+-----------+-------------+------------------------+
|               1773774 | UNIQ_a     | RECORD    | S         | GRANTED     | 'm', 3                 |
|               1773774 | PRIMARY    | RECORD    | X         | GRANTED     | supremum pseudo-record |
+-----------------------+------------+-----------+-----------+-------------+------------------------+
/*
  Last row shows the problem. But to complete the example, insert a row _after_ the gap (m, 3):
*/

-- Trx 3
begin; insert into t (a) values ('x');
/* blocked... */

/* Data locks */
+-----------------------+------------+-----------+--------------------+-------------+------------------------+
| engine_transaction_id | index_name | lock_type | lock_mode          | lock_status | lock_data              |
+-----------------------+------------+-----------+--------------------+-------------+------------------------+
|               1773774 | UNIQ_a     | RECORD    | S                  | GRANTED     | 'm', 3                 |
|               1773774 | PRIMARY    | RECORD    | X                  | GRANTED     | supremum pseudo-record |
|               1773777 | PRIMARY    | RECORD    | X,INSERT_INTENTION | WAITING     | supremum pseudo-record |
+-----------------------+------------+-----------+--------------------+-------------+------------------------+
/*
  Value 'm' should be out of the gap (after 'm'), but it still blocks on insert.
*/

Suggested fix:
Fix: don't lock PK supremum.

Correction to last line:
- "Value 'm' should be out of the gap (after 'm'), but it still blocks on insert."
+ "Value 'x' should be out of the gap (after 'm'), but it still blocks on insert."
----------^

AFAICT, the way the
```
|               1773774 | PRIMARY    | RECORD    | X         | GRANTED     | supremum pseudo-record |
```
is brought into existence is roughly this:
...Actually...Before I explain this, there is one important detail: the way AUTO_INCREMENT is handled is that the value of id=2 is assigned pretty early in the game, so when reading the following, imagine that trx#2 is doing INSERT INTO t VALUES(2,'m')...You may question if this is a bug in itself, but that would be a separate discussion, while here, I'd like to explain why you see what you see w.r.t. locking, under assumption that id=2...

1. Trx#2 first inserts a record {id:2,val:'m'} to the clustered index
2. Then it tries to perform duplicate check in secondary index, where, as you know, it waits for Trx#1 to make up its mind. Once Trx#1 commits, it is now clear that {a:'m',id:2} is a duplicate conflict on UNIQ_a index on `a` column with the {val:'m',id:1} already inserted by Trx#1, so Trx#2 have to trx_rollback_to_savepoint(Trx#2), i.e. undo whatever it did so far in the current INSERT statement
3. This involves removing the record with `id=2` from the clustered index
4. Usually, this would be done by merely delete-marking it, so it is later physically removed. But, here, this is a newly inserted row, and we are about to pop the undo log record about it from the undo log stack (we are rolling back to savepoint) so instead of relying on background undo log purge, we have to physically remove it eagerly from the Trx#2 thread
5. But, if we remove the record with `id=2` physically, then another transaction, say Trx#4, could do `INSERT INTO t (id, val) VALUES (2,"x")` and commit before Trx#2. You might think this is fine, but then the observable serialized commit history would be:
 "Trx#4 committed 'before' Trx#2, so it inserted a row with id=2 'before' Trx#2".
You might think it is fine, but actually that changes the verdict w.r.t. which column has caused a constraint violation: if Trx#4 really happened before Trx#2 then the INSERT INTO t VALUES (2,'m') should fail on testing if id=2 is unique, which it wouldn't. You might think this doesn't matter, but it could matter. It might matter to a Replica. It might matter if there are triggers for specific columns. It is important which of the columns was the problem and we want to ensure that the locks taken by Trx#2 are such, that the only possible history would end up with same verdict. Thus we have to guard the hole created by record {id:2,val:'m'} in the clustered index, so a history in which someone else inserts id:2 before we commit is impossible. This is done by calling:
```
void row_convert_impl_to_expl_if_needed(btr_cur_t *cursor, undo_node_t *node) {
  /* In case of partial rollback implicit lock on the
  record is released in the middle of transaction, which
  can break the serializability of IODKU and REPLACE
  statements. Normal rollback is not affected by this
  because we release the locks after the rollback. So
  to prevent any other transaction modifying the record
  in between the partial rollback we convert the implicit
  lock on the record to explicit. When the record is actually
  deleted this lock will be inherited by the next record.  */

  if (!node->partial || node->trx.isolation_level < trx_t::REPEATABLE_READ) {
    return;
  }

  ut_ad(node->trx.in_rollback);
  auto index = cursor->index;
  auto rec = btr_cur_get_rec(cursor);
  auto block = btr_cur_get_block(cursor);
  auto heap_no = page_rec_get_heap_no(rec);

  if (heap_no != PAGE_HEAP_NO_SUPREMUM && !dict_index_is_spatial(index) &&
      !index->table->is_temporary() && !index->table->is_intrinsic()) {
    lock_rec_convert_impl_to_expl(block, rec, index,
                                  Rec_offsets().compute(rec, index));
  }
}
```
which as the code comment explains leads to two sub-steps:
6. first the implicit lock held on clustered index record (implied by DB_TRX_ID field of the record being equal to active trx#2's id) is converted to an explicit lock (struct lock_t in memory with LOCK_REC|LOCK_X|LOCK_REC_NOT_GAP) so that it is not lost
7. then we remove the record physically, which in turn causes inheritance of LOCK_REC_NOT_GAP onto the resulting gap by calling `lock_rec_inherit_to_gap(..)` from `lock_update_delete(..)`, which has this helpful code comment:
```
  /* If session is using READ COMMITTED or READ UNCOMMITTED isolation
  level, we do not want locks set by an UPDATE or a DELETE to be
  inherited as gap type locks.  But we DO want S-locks/X-locks(taken for
  replace) set by a consistency constraint to be inherited also then. */

  /* We also dont inherit these locks as gap type locks for DD tables
  because the serialization is guaranteed by MDL on DD tables. */

  /* Constraint checks place LOCK_S or (in case of INSERT ... ON DUPLICATE
  UPDATE... or REPLACE INTO..) LOCK_X on records.
  If such a record is delete-marked, it may then become purged, and
  lock_rec_inheirt_to_gap will be called to decide the fate of each lock on it:
  either it will be inherited as gap lock, or discarded.
  In READ COMMITTED and less restricitve isolation levels we generally avoid gap
  locks, but we make an exception for precisely this situation: we want to
  inherit locks created for constraint checks.
  More precisely we need to keep inheriting them only for the duration of the
  query which has requested them, as such inserts have two phases : first they
  check for constraints, then they do actual row insert, and they trust that
  the locks set in the first phase will survive till the second phase.
  It is not easy to tell if a particular lock was created for constraint check
  or not, because we do not store this bit of information on it.
  What we do, is we use a heuristic: whenever a trx requests a lock with
  lock_duration_t::AT_LEAST_STATEMENT we set trx->lock.inherit_all, meaning that
  locks of this trx need to be inherited.
  And we clear trx->lock.inherit_all on statement end. */
```
As the "resulting gap" is the one at the end of the page, we create a lock on supremum.
Note: supremum is not a real "point" in keyspace, every lock on "it" is always interpreted as a lock only on the gap "before it" (i.e. after the last user record on the page), so "X" which you see on it, is not a "gap+record" ("next-key") lock, but rather exactly what you would expect from converting LOCK_REC_NOT_GAP|LOCK_REC|LOCK_X on {id:2} record into a LOCK_GAP|LOCK_REC|LOCK_X on a gap - a gap lock.

Could it be done better?
I don't know!
There are so many moving parts, that we try to err on the safe side.
Could we detect that binlog is off, or replication is off, or that row based replication doesn't care about side effects from triggers and ON DUPLICATE KEY UPDATE? Would it be sufficient?
Could we detect that the column in question is the primary key with AUTO_INCREMENT, *and* that the statement did not specify its value explicitly, so "of course" there could be no conflict on this particular column, so "of course" we don't have to protect the gap?
And can one have columns GENERATED from PK, and constraints on them, and violations involving them?

I am not saying it can not be done. It's just: I don't see an easy and correct way.
For more context, there are also discussions in:
[Bug#MySQL Bugs: #98324: Deadlocks more frequent since version 5.7.26](https://bugs.mysql.com/bug.php?id=98324)
[Bug#MySQL Bugs: #110266: The secondary node of the mgr cluster has a deadlock](https://bugs.mysql.com/bug.php?id=110266)

Thanks for the quick and detailed explanation! I think I understand. I'll write up some other short examples to verify my understanding.

Quick fix/workaround for future reference: in READ COMMITTED, the supremum is not locked.

The example above presumed the default, REPEATABLE READ.

Hi again. Following are 3 examples to see if I understand correctly what's happening and why. For these examples, it's the same table as above but without auto-inc and with a few more rows:

 [1, a]
 [2, b]
 {    } "gap 3"
 {    } "gap 4"
 [5, e]
 [6, f]
 (supremum)

So: INSERT INTO t VALUES (1,'a'), (2,'b'), /* gaps 3 and 4 */ (5,'e'), (6, 'f');

The key space has 4 records with a 2-record gap in the middle: id=3 ("gap 3") and id=4 ("gap 4").

For each example, order of events is the same:

 1. Trx 1 begins with INSERT INTO t (3, 'x')
 2. Trx 2 begins with a example-specific INSERT (denoted by "->")
 3. Trx 1 commits
 4. Trx 2 returns dupe key error (trx still active)

~~~~~~~~~
Example 1
~~~~~~~~~

-> INSERT INTO t (3, 'x') /* trx 2 */

Trx 2 is trying to insert the exact same row as trx 1...

Before trx 1 commit:

 [1, a]
 [2, b]
 <3, x> -- trx#1 granted PRIMARY | X,REC_NOT_GAP | 3
 <3, x> -- trx#2 waiting PRIMARY | S,REC_NOT_GAP | 3
 {    }
 [5, e]
 [6, f]
 (supremum)

After:

 [1, a]
 [2, b]
<[3, x]> -- trx#2 granted PRIMARY | S,REC_NOT_GAP | 3
 {    }
 [5, e]
 [6, f]
 (supremum)

Trx 2 gets a dupe key error on PK id=3 because the clustered key (PK) record is inserted first.
Trx 2 holds a shared record lock only on PK id=3 because that's all it saw (it didn't see gap 4 or secondary index).

~~~~~~~~~
Example 2
~~~~~~~~~

-> INSERT INTO t (4, 'x') /* trx 2 */

Trx 2 is inserting a different PK value (4) but same unique secondary index value (x)...

Before trx 1 commit:

 [1, a]
 [2, b]
 <3, x> -- trx#1 granted UNIQ_a | X,REC_NOT_GAP | 'x'
 <4, x> -- trx#2 waiting UNIQ_a | S             | 'x'
 [5, e]
 [6, f]
 (supremum)

After:

 [1, a]
 [2, b]
 [3, x]
 <    > -- trx#2 granted UNIQ_a  | S     | 'x'
 [5, e]          granted PRIMARY | X,GAP | 5
 [6, f]
 (supremum)

Trx 2 inserts PK id=4 but gets a dupe key error on secondary index a='x'; it holds a shared next-key lock on that secondary index record.
Since PK id=4 record is removed, it leaves a gap (effectively recreating gap 4) that trx 2 locks.

NOTE 1: The shared next-key lock on secondary index value 'x' prevents inserting new rows where a < 'x' (since a next-key lock is the record and any gap before it).

NOTE 2: In read committed, trx 2 does not lock the PK gap but it still acquires a next-key lock on the secondary index.

~~~~~~~~~
Example 3
~~~~~~~~~

-> INSERT INTO t (7, 'x') /* trx 2 */

Like example 2 but inserting at end of table, similar to id column being auto-inc...

Before trx 1 commit:

 [1, a]
 [2, b]
 <3, x> -- trx#1 granted UNIQ_a | X,REC_NOT_GAP | 'x'
 {    }
 [5, e]
 [6, f]
 <7, x> -- trx#2 waiting UNIQ_a | S | 'x'
 (supremum)

After:

 [1, a]
 [2, b]
 [3, x]
 {    }
 [5, e]
 [6, f]
 <    > -- trx#2 granted UNIQ_a  | S | 'x'
                 granted PRIMARY | X | supremum
 (supremum)

Same as example 2, but wrt the PK: since PK id=7 is removed _and_ it was last/great value, the resulting gap is the supremum that covers all values greater than the last record.

NOTE 3: In read committed, the supremum is not locked.

---

If my understanding is correct, then the original issue is exacerbated by the auto-inc PK because 1) the cluster key (PK) record is created first and so 2) every insert affects the end of the table (inserting into the gap covered by the supremum) before MySQL checks the unique secondary index.

> Quick fix/workaround for future reference: in READ COMMITTED, the supremum is not locked.

Huh, this puzzled me, as I was rather expecting this behaviour irrespective of isolation level. But, indeed, even in the code which I quoted above we clearly see:
```
  if (!node->partial || node->trx.isolation_level < trx_t::REPEATABLE_READ) {
    return;
  }
```
and indeed I could replicate your claim.
But, why do we have this special case in code?
I did some archaeology: it was added by Aditya in 6th revision of attempt at fixing Bug#25966845: INSERT ON DUPLICATE KEY GENERATE A DEADLOCK:
https://rb.mysql.oraclecorp.com/rb/r/20418/diff/5-6/
in repsonse to Debarun's comment:
https://rb.mysql.oraclecorp.com/rb/r/20418/#comment223769
> Make the fix specific to isolation RR and above.

No further rationale was provided.
So, I've looked into bug tracker for
Bug#25966845: INSERT ON DUPLICATE KEY GENERATE A DEADLOCK
https://mybug.mysql.oraclecorp.com/orabugs/site/bug.php?id=25966845
and found that the context for this bug was that (as the title suggests) we were trying to find a better way to solve an older bug, as the solution we had originally for it, was causing deadlocks. And that older bug was about serialization issues:
Bug#11758237 INSERT ON DUPLICATE KEY UPDATE SOMETIMES WRITES BINLOG POSITION INCORRECT 
https://mybug.mysql.oraclecorp.com/orabugs/site/bug.php?id=11758237
This is quite a long history, and I have not re-lived it all now, but what I gather from skimming is that the problems were related to statement based replication, where it was crucial that conflicts on unique columns are determined and resolved same way on the Leader as on Replica.
According to https://dev.mysql.com/doc/refman/8.4/en/binary-log-setting.html there was never a guarantee for statement-based replication to work fine unless you use REPEATABLE READ or SERIALIZABLE:
> If you are using InnoDB tables and the transaction isolation level is READ COMMITTED or READ UNCOMMITTED, only row-based logging can be used. It is possible to change the logging format to STATEMENT, but doing so at runtime leads very rapidly to errors because InnoDB can no longer perform inserts. 
This I think explains, or at least provides the context, for why the scope of that fix was narrowed down only to REPEATABLE READ or SERIALIZABLE.

As for your explanation, I think it is mostly right, although some of your ascii arts suggest a wrong mental picture, in which "gaps" are somehow tied to non-exiting keys, almost 1:1 (which could be an 'infinite' number of gaps in case of strings) so instead of:
```
 [1, a]
 [2, b]
 {    } "gap 3"
 {    } "gap 4"
 [5, e]
 [6, f]
 (supremum)
```
a better mental model is that of points and ranges between them:
```
 (page's infimum)
 {    } "gap before 1"
 [1, a]
 {    } "gap before 2"
 [2, b]
 {    } "gap before 5"
 [5, e]
 {    } "gap before 6"
 [6, f]
 {    } "gap before supremum of the page = after 6"
 (page's supremum)
```
in particular there is just one gap between 2 and 5.
A situation might be later complicated by presence of delete-marked records which also subdivide the axis if they physically exist.

Thank you, Jakub. I think we can conclude that this isn't a bug; rather, it's a grey zone wrt locking and a bad access pattern: app causing frequent unique secondary index key collisions.

Since it's not technically wrong, I don't expect it to be changed any time soon since changing things in this area is difficult and risky. But at least we now understand what's going on and why.