Bug #116815 | INSERT locks primary key supremum after secondary index duplicate key collision | ||
---|---|---|---|
Submitted: | 28 Nov 2024 3:51 | Modified: | 4 Dec 2024 17:38 |
Reporter: | Daniel Nichter (OCA) | Email Updates: | |
Status: | Analyzing | Impact on me: | |
Category: | MySQL Server: InnoDB storage engine | Severity: | S2 (Serious) |
Version: | 8.0.30 | OS: | Any |
Assigned to: | MySQL Verification Team | CPU Architecture: | Any |
[28 Nov 2024 3:51]
Daniel Nichter
[28 Nov 2024 14:07]
Daniel Nichter
Correction to last line: - "Value 'm' should be out of the gap (after 'm'), but it still blocks on insert." + "Value 'x' should be out of the gap (after 'm'), but it still blocks on insert." ----------^
[29 Nov 2024 11:13]
Jakub Lopuszanski
AFAICT, the way the ``` | 1773774 | PRIMARY | RECORD | X | GRANTED | supremum pseudo-record | ``` is brought into existence is roughly this: ...Actually...Before I explain this, there is one important detail: the way AUTO_INCREMENT is handled is that the value of id=2 is assigned pretty early in the game, so when reading the following, imagine that trx#2 is doing INSERT INTO t VALUES(2,'m')...You may question if this is a bug in itself, but that would be a separate discussion, while here, I'd like to explain why you see what you see w.r.t. locking, under assumption that id=2... 1. Trx#2 first inserts a record {id:2,val:'m'} to the clustered index 2. Then it tries to perform duplicate check in secondary index, where, as you know, it waits for Trx#1 to make up its mind. Once Trx#1 commits, it is now clear that {a:'m',id:2} is a duplicate conflict on UNIQ_a index on `a` column with the {val:'m',id:1} already inserted by Trx#1, so Trx#2 have to trx_rollback_to_savepoint(Trx#2), i.e. undo whatever it did so far in the current INSERT statement 3. This involves removing the record with `id=2` from the clustered index 4. Usually, this would be done by merely delete-marking it, so it is later physically removed. But, here, this is a newly inserted row, and we are about to pop the undo log record about it from the undo log stack (we are rolling back to savepoint) so instead of relying on background undo log purge, we have to physically remove it eagerly from the Trx#2 thread 5. But, if we remove the record with `id=2` physically, then another transaction, say Trx#4, could do `INSERT INTO t (id, val) VALUES (2,"x")` and commit before Trx#2. You might think this is fine, but then the observable serialized commit history would be: "Trx#4 committed 'before' Trx#2, so it inserted a row with id=2 'before' Trx#2". You might think it is fine, but actually that changes the verdict w.r.t. which column has caused a constraint violation: if Trx#4 really happened before Trx#2 then the INSERT INTO t VALUES (2,'m') should fail on testing if id=2 is unique, which it wouldn't. You might think this doesn't matter, but it could matter. It might matter to a Replica. It might matter if there are triggers for specific columns. It is important which of the columns was the problem and we want to ensure that the locks taken by Trx#2 are such, that the only possible history would end up with same verdict. Thus we have to guard the hole created by record {id:2,val:'m'} in the clustered index, so a history in which someone else inserts id:2 before we commit is impossible. This is done by calling: ``` void row_convert_impl_to_expl_if_needed(btr_cur_t *cursor, undo_node_t *node) { /* In case of partial rollback implicit lock on the record is released in the middle of transaction, which can break the serializability of IODKU and REPLACE statements. Normal rollback is not affected by this because we release the locks after the rollback. So to prevent any other transaction modifying the record in between the partial rollback we convert the implicit lock on the record to explicit. When the record is actually deleted this lock will be inherited by the next record. */ if (!node->partial || node->trx.isolation_level < trx_t::REPEATABLE_READ) { return; } ut_ad(node->trx.in_rollback); auto index = cursor->index; auto rec = btr_cur_get_rec(cursor); auto block = btr_cur_get_block(cursor); auto heap_no = page_rec_get_heap_no(rec); if (heap_no != PAGE_HEAP_NO_SUPREMUM && !dict_index_is_spatial(index) && !index->table->is_temporary() && !index->table->is_intrinsic()) { lock_rec_convert_impl_to_expl(block, rec, index, Rec_offsets().compute(rec, index)); } } ``` which as the code comment explains leads to two sub-steps: 6. first the implicit lock held on clustered index record (implied by DB_TRX_ID field of the record being equal to active trx#2's id) is converted to an explicit lock (struct lock_t in memory with LOCK_REC|LOCK_X|LOCK_REC_NOT_GAP) so that it is not lost 7. then we remove the record physically, which in turn causes inheritance of LOCK_REC_NOT_GAP onto the resulting gap by calling `lock_rec_inherit_to_gap(..)` from `lock_update_delete(..)`, which has this helpful code comment: ``` /* If session is using READ COMMITTED or READ UNCOMMITTED isolation level, we do not want locks set by an UPDATE or a DELETE to be inherited as gap type locks. But we DO want S-locks/X-locks(taken for replace) set by a consistency constraint to be inherited also then. */ /* We also dont inherit these locks as gap type locks for DD tables because the serialization is guaranteed by MDL on DD tables. */ /* Constraint checks place LOCK_S or (in case of INSERT ... ON DUPLICATE UPDATE... or REPLACE INTO..) LOCK_X on records. If such a record is delete-marked, it may then become purged, and lock_rec_inheirt_to_gap will be called to decide the fate of each lock on it: either it will be inherited as gap lock, or discarded. In READ COMMITTED and less restricitve isolation levels we generally avoid gap locks, but we make an exception for precisely this situation: we want to inherit locks created for constraint checks. More precisely we need to keep inheriting them only for the duration of the query which has requested them, as such inserts have two phases : first they check for constraints, then they do actual row insert, and they trust that the locks set in the first phase will survive till the second phase. It is not easy to tell if a particular lock was created for constraint check or not, because we do not store this bit of information on it. What we do, is we use a heuristic: whenever a trx requests a lock with lock_duration_t::AT_LEAST_STATEMENT we set trx->lock.inherit_all, meaning that locks of this trx need to be inherited. And we clear trx->lock.inherit_all on statement end. */ ``` As the "resulting gap" is the one at the end of the page, we create a lock on supremum. Note: supremum is not a real "point" in keyspace, every lock on "it" is always interpreted as a lock only on the gap "before it" (i.e. after the last user record on the page), so "X" which you see on it, is not a "gap+record" ("next-key") lock, but rather exactly what you would expect from converting LOCK_REC_NOT_GAP|LOCK_REC|LOCK_X on {id:2} record into a LOCK_GAP|LOCK_REC|LOCK_X on a gap - a gap lock. Could it be done better? I don't know! There are so many moving parts, that we try to err on the safe side. Could we detect that binlog is off, or replication is off, or that row based replication doesn't care about side effects from triggers and ON DUPLICATE KEY UPDATE? Would it be sufficient? Could we detect that the column in question is the primary key with AUTO_INCREMENT, *and* that the statement did not specify its value explicitly, so "of course" there could be no conflict on this particular column, so "of course" we don't have to protect the gap? And can one have columns GENERATED from PK, and constraints on them, and violations involving them? I am not saying it can not be done. It's just: I don't see an easy and correct way. For more context, there are also discussions in: [Bug#MySQL Bugs: #98324: Deadlocks more frequent since version 5.7.26](https://bugs.mysql.com/bug.php?id=98324) [Bug#MySQL Bugs: #110266: The secondary node of the mgr cluster has a deadlock](https://bugs.mysql.com/bug.php?id=110266)
[29 Nov 2024 22:28]
Daniel Nichter
Thanks for the quick and detailed explanation! I think I understand. I'll write up some other short examples to verify my understanding.
[29 Nov 2024 22:34]
Daniel Nichter
Quick fix/workaround for future reference: in READ COMMITTED, the supremum is not locked. The example above presumed the default, REPEATABLE READ.
[2 Dec 2024 5:46]
Daniel Nichter
Hi again. Following are 3 examples to see if I understand correctly what's happening and why. For these examples, it's the same table as above but without auto-inc and with a few more rows: [1, a] [2, b] { } "gap 3" { } "gap 4" [5, e] [6, f] (supremum) So: INSERT INTO t VALUES (1,'a'), (2,'b'), /* gaps 3 and 4 */ (5,'e'), (6, 'f'); The key space has 4 records with a 2-record gap in the middle: id=3 ("gap 3") and id=4 ("gap 4"). For each example, order of events is the same: 1. Trx 1 begins with INSERT INTO t (3, 'x') 2. Trx 2 begins with a example-specific INSERT (denoted by "->") 3. Trx 1 commits 4. Trx 2 returns dupe key error (trx still active) ~~~~~~~~~ Example 1 ~~~~~~~~~ -> INSERT INTO t (3, 'x') /* trx 2 */ Trx 2 is trying to insert the exact same row as trx 1... Before trx 1 commit: [1, a] [2, b] <3, x> -- trx#1 granted PRIMARY | X,REC_NOT_GAP | 3 <3, x> -- trx#2 waiting PRIMARY | S,REC_NOT_GAP | 3 { } [5, e] [6, f] (supremum) After: [1, a] [2, b] <[3, x]> -- trx#2 granted PRIMARY | S,REC_NOT_GAP | 3 { } [5, e] [6, f] (supremum) Trx 2 gets a dupe key error on PK id=3 because the clustered key (PK) record is inserted first. Trx 2 holds a shared record lock only on PK id=3 because that's all it saw (it didn't see gap 4 or secondary index). ~~~~~~~~~ Example 2 ~~~~~~~~~ -> INSERT INTO t (4, 'x') /* trx 2 */ Trx 2 is inserting a different PK value (4) but same unique secondary index value (x)... Before trx 1 commit: [1, a] [2, b] <3, x> -- trx#1 granted UNIQ_a | X,REC_NOT_GAP | 'x' <4, x> -- trx#2 waiting UNIQ_a | S | 'x' [5, e] [6, f] (supremum) After: [1, a] [2, b] [3, x] < > -- trx#2 granted UNIQ_a | S | 'x' [5, e] granted PRIMARY | X,GAP | 5 [6, f] (supremum) Trx 2 inserts PK id=4 but gets a dupe key error on secondary index a='x'; it holds a shared next-key lock on that secondary index record. Since PK id=4 record is removed, it leaves a gap (effectively recreating gap 4) that trx 2 locks. NOTE 1: The shared next-key lock on secondary index value 'x' prevents inserting new rows where a < 'x' (since a next-key lock is the record and any gap before it). NOTE 2: In read committed, trx 2 does not lock the PK gap but it still acquires a next-key lock on the secondary index. ~~~~~~~~~ Example 3 ~~~~~~~~~ -> INSERT INTO t (7, 'x') /* trx 2 */ Like example 2 but inserting at end of table, similar to id column being auto-inc... Before trx 1 commit: [1, a] [2, b] <3, x> -- trx#1 granted UNIQ_a | X,REC_NOT_GAP | 'x' { } [5, e] [6, f] <7, x> -- trx#2 waiting UNIQ_a | S | 'x' (supremum) After: [1, a] [2, b] [3, x] { } [5, e] [6, f] < > -- trx#2 granted UNIQ_a | S | 'x' granted PRIMARY | X | supremum (supremum) Same as example 2, but wrt the PK: since PK id=7 is removed _and_ it was last/great value, the resulting gap is the supremum that covers all values greater than the last record. NOTE 3: In read committed, the supremum is not locked. --- If my understanding is correct, then the original issue is exacerbated by the auto-inc PK because 1) the cluster key (PK) record is created first and so 2) every insert affects the end of the table (inserting into the gap covered by the supremum) before MySQL checks the unique secondary index.
[2 Dec 2024 10:29]
Jakub Lopuszanski
> Quick fix/workaround for future reference: in READ COMMITTED, the supremum is not locked. Huh, this puzzled me, as I was rather expecting this behaviour irrespective of isolation level. But, indeed, even in the code which I quoted above we clearly see: ``` if (!node->partial || node->trx.isolation_level < trx_t::REPEATABLE_READ) { return; } ``` and indeed I could replicate your claim. But, why do we have this special case in code? I did some archaeology: it was added by Aditya in 6th revision of attempt at fixing Bug#25966845: INSERT ON DUPLICATE KEY GENERATE A DEADLOCK: https://rb.mysql.oraclecorp.com/rb/r/20418/diff/5-6/ in repsonse to Debarun's comment: https://rb.mysql.oraclecorp.com/rb/r/20418/#comment223769 > Make the fix specific to isolation RR and above. No further rationale was provided. So, I've looked into bug tracker for Bug#25966845: INSERT ON DUPLICATE KEY GENERATE A DEADLOCK https://mybug.mysql.oraclecorp.com/orabugs/site/bug.php?id=25966845 and found that the context for this bug was that (as the title suggests) we were trying to find a better way to solve an older bug, as the solution we had originally for it, was causing deadlocks. And that older bug was about serialization issues: Bug#11758237 INSERT ON DUPLICATE KEY UPDATE SOMETIMES WRITES BINLOG POSITION INCORRECT https://mybug.mysql.oraclecorp.com/orabugs/site/bug.php?id=11758237 This is quite a long history, and I have not re-lived it all now, but what I gather from skimming is that the problems were related to statement based replication, where it was crucial that conflicts on unique columns are determined and resolved same way on the Leader as on Replica. According to https://dev.mysql.com/doc/refman/8.4/en/binary-log-setting.html there was never a guarantee for statement-based replication to work fine unless you use REPEATABLE READ or SERIALIZABLE: > If you are using InnoDB tables and the transaction isolation level is READ COMMITTED or READ UNCOMMITTED, only row-based logging can be used. It is possible to change the logging format to STATEMENT, but doing so at runtime leads very rapidly to errors because InnoDB can no longer perform inserts. This I think explains, or at least provides the context, for why the scope of that fix was narrowed down only to REPEATABLE READ or SERIALIZABLE. As for your explanation, I think it is mostly right, although some of your ascii arts suggest a wrong mental picture, in which "gaps" are somehow tied to non-exiting keys, almost 1:1 (which could be an 'infinite' number of gaps in case of strings) so instead of: ``` [1, a] [2, b] { } "gap 3" { } "gap 4" [5, e] [6, f] (supremum) ``` a better mental model is that of points and ranges between them: ``` (page's infimum) { } "gap before 1" [1, a] { } "gap before 2" [2, b] { } "gap before 5" [5, e] { } "gap before 6" [6, f] { } "gap before supremum of the page = after 6" (page's supremum) ``` in particular there is just one gap between 2 and 5. A situation might be later complicated by presence of delete-marked records which also subdivide the axis if they physically exist.
[4 Dec 2024 17:38]
Daniel Nichter
Thank you, Jakub. I think we can conclude that this isn't a bug; rather, it's a grey zone wrt locking and a bad access pattern: app causing frequent unique secondary index key collisions. Since it's not technically wrong, I don't expect it to be changed any time soon since changing things in this area is difficult and risky. But at least we now understand what's going on and why.