MySQL Bugs: #115511: Inplace ALTER TABLE might fail with duplicate key error if concurrent insertions

Bug #115511	Inplace ALTER TABLE might fail with duplicate key error if concurrent insertions
Submitted:	4 Jul 2024 11:46	Modified:	16 Jan 16:57
Reporter:	Dmitry Lenev (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S3 (Non-critical)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	Contribution

Description:
ALTER TABLE with rebuilds InnoDB table using INPLACE algorithm occasionally might fail with unwarranted duplicate primary key error if there are concurrent insertions into the table, even though these insertions do not cause any PK conflict.

This issue is different from one described in bug #98600 "Optimize table fails with duplicate entry on UNIQUE KEY" as unlike in #98600 case concurrent INSERTs causing the problem use non-conflicting PK values and thus do not cause PK violations even temporarily!

Users have observed this problem in the wild and I can repeat it non-deterministically by running concurrent workload involving INSERTions and ALTER TABLEs.

However, it is much easier to reproduce and analyze it using deterministic test case involving DEBUG and DEBUG_SYNC facilities, so I will provide this test case instead non-deterministic / highly concurrent one.

How to repeat:
Introduce a new debug sync point by applying the following patch (based on MySQL 8.0.38) and building debug version of server:

diff --git a/storage/innobase/ddl/ddl0par-scan.cc b/storage/innobase/ddl/ddl0par-scan.cc
index f7f948a0f74..f334a7d2f5c 100644
--- a/storage/innobase/ddl/ddl0par-scan.cc
+++ b/storage/innobase/ddl/ddl0par-scan.cc
@@ -237,6 +237,7 @@ dberr_t Parallel_cursor::scan(Builders &builders) noexcept {
             thread_ctx->get_state() != Parallel_reader::State::THREAD) {
           thread_ctx->savepoint();
           latches_released = true;
+          DEBUG_SYNC_C("ddl_bulk_inserter_latches_released");
         }
         return DB_SUCCESS;
       });

Run the following test for MTR framework against the newly built debug server:

CREATE TABLE t1 (pk CHAR(5) PRIMARY KEY);
INSERT INTO t1 VALUES ('aaaaa'), ('bbbbb'), ('ccccc'), ('ddddd'), ('eeeee');

--connect (con1, localhost, root,,)
# We use the following debug flag to simulate overflow of key buffer in InnoDB
# online ALTER TABLE implementation.
SET DEBUG='+d,ddl_buf_add_two';
SET DEBUG_SYNC='ddl_bulk_inserter_latches_released SIGNAL latches_released WAIT_FOR go';
--send ALTER TABLE t1 ENGINE=InnoDB, ALGORITHM=INPLACE

--connection default
SET DEBUG_SYNC='now WAIT_FOR latches_released';
INSERT INTO t1 VALUES ('ccaaa');
SET DEBUG_SYNC='now SIGNAL go';

--connection con1
# Instead of succeeding ALTER TABLE fails with
# "ERROR 1062 (23000): Duplicate entry 'ccccc' for key 't1.PRIMARY'' error
--reap

SET DEBUG='-d,ddl_buf_add_two';

--connection default
DROP TABLE t1;

Suggested fix:
It seems that the problem was introduced by WL#14283 - Parallel create index implementation in 8.0.

AFAICS it is related to the way how parallel read code used by online ALTER TABLE saves/restore cursor position while scanning old table version when it needs to flush changes to new version of the table:

void PCursor::savepoint() noexcept {
  /* Store the cursor position on the previous user record on the page. */
  m_pcur->move_to_prev_on_page();

  m_pcur->store_position(m_mtr);

  m_mtr->commit();
}

void PCursor::resume() noexcept {
  m_mtr->start();

  m_mtr->set_log_mode(MTR_LOG_NO_REDO);

  /* Restore position on the record, or its predecessor if the record
  was purged meanwhile. */

  restore_position();

  if (!m_pcur->is_after_last_on_page()) {
    /* Move to the successor of the saved record. */
    m_pcur->move_to_next_on_page();
  }
}

For example, what happens in the above test case is:
1) ALTER TABLE tries to save cursor position when processing 'ccccc' record, saves the position to previous record on page (i.e. 'bbbbb'),
2) then after processing inserts into new table version, it restores the cursor to point to 'bbbbb'
3) and then moves it to the next record with intention to make it to point to 'ccccc', however thanks to concurrent insert it ends up pointing to 'ccaaa' record, 
4) which means that the code will process 'ccccc' once again causing duplicate key error.

Hi Mr. Lenev,

Thank you very much for your bug report.

We have already built a debug version of mysql-8.0.38 (with memory checking included). Hence, we have just introduced the said DEBUG_SYNC_C and ran your test case.

We then ran your test case and indeed got:

"ERROR 1062 (23000): Duplicate entry 'ccccc' for key 't1.PRIMARY'' error

This is now a verified bug for 8.0 and higher supported versions.

We thank you very much for your patch contribution.

Verified.

Hi Dmitry,

This is the full output from the test run:

------------------------------------------------------------------

==============================================================================
                  TEST NAME                       RESULT  TIME (ms) COMMENT
------------------------------------------------------------------------------
[ 50%] main.dmitry_115511                        [ fail ]
        Test ended at 2024-07-04 17:14:14

CURRENT_TEST: main.dmitry_115511
mysqltest: At line 20: Query 'reap' failed.
ERROR 1062 (23000): Duplicate entry 'ccccc' for key 't1.PRIMARY'
safe_process[33185]: Child process: 33186, exit: 1

Mysqltest client output from logfile
----------- MYSQLTEST OUTPUT START -----------

CREATE TABLE t1 (pk CHAR(5) PRIMARY KEY);
INSERT INTO t1 VALUES ('aaaaa'), ('bbbbb'), ('ccccc'), ('ddddd'), ('eeeee');
SET DEBUG='+d,ddl_buf_add_two';
SET DEBUG_SYNC='ddl_bulk_inserter_latches_released SIGNAL latches_released WAIT_FOR go';
ALTER TABLE t1 ENGINE=InnoDB, ALGORITHM=INPLACE;
SET DEBUG_SYNC='now WAIT_FOR latches_released';
INSERT INTO t1 VALUES ('ccaaa');
SET DEBUG_SYNC='now SIGNAL go';

------------ MYSQLTEST OUTPUT END -----------

-----------------------------------------------------------------

This is something I was also wondering about in another context
https://gerrit.mysql.oraclecorp.com/c/MySQL/+/16339
in May 2024 I found that `dberr_t PCursor::move_to_next_block(dict_index_t *index)` is sometimes calling `savepoint()` while the current position of the cursor is somewhere in the middle of a page.
I wrote:
"""
I think this PCursor has a bug in that it doesn't properly handle a case where new records were appended to the current page while the cursor was being saved&restored.

A few lines above we call savepoint() after asserting that ut_ad(m_pcur->is_after_last_on_page()); .
Here we don't have such an assert.
Probably because it isn't certain that it would hold - it probably holds most of the time, but what happens if some other thread appends new records to the current page?
If we are seeing err == DB_LOCK_NOWAIT , then it means that we have returned from PCursor::move_to_user_rec() from PCursor::restore_from_savepoint() after it has called PCursor::resume() - it's a bit difficult to tell for certain if the cursor is at a user record or after the last record, because both PCursor::resume() and PCursor::restore_position() which is called by it, try to advance the cursor forward, so perhaps it does point "after last record", or maybe not - depends on how many new records where inserted while we weren't holding a latch, I think.
If it is pointing to a record (as opposed to: after last record) then there's a small problem here: savepoint() will back off by one record as it calls m_pcur->move_to_prev_on_page() .
Also, PCursor::move_to_user_rec() is written as if it assumed that the cursor must point after last record, because it immediately jumps to the next page, without checking at all if the current page has any more records.
Therefore I get an impression that this save&restore procedure can (depending on what other threads are doing):
a) miss some records which were appended in between?
b) perhaps process the same record twice in case of retry?
"""
Basically, I wonder what is the reason why PCursor::savepoint() and PCursor::resume() aren't doing the obvious thing (storing the current position, restoring it), but instead seem to store prev(current) and then restore next(saved)? This isn't equivalent if the table is being modified...
Perhaps the assumption was that current==Supremum record, and thus it can't be "saved" because it has no reasonable "value".
But:
1. As we can see the cursor sometimes points to a regular record. Looks like you've found more instances of this.
2. The design would be cleaner if it were caller's responsibility to adjust the cursor to point to real record

My mental model of what it would mean to "pause a scan" and then "resume it" is that one has to think in terms of "key-space" axis (as opposed to: about actual records in actual index) - that is there's some range of the "key-space" which was already seen/covered, and some range "to be seen", and thus we need to store "point in the key-space" such that we should resume from it.
Moreover, we have to be precise about the ranges being open/closed, and somehow take into account the fact that we can't manufacture arbitrary values of points in key-space - we can only copy an actual value we seen in the database and store that.
Therefore, I think the only reasonable approach would be to store the value of most recently processed user record, and resume from the next record after it.
By "after it", I mean that if it still exists, then we take the next record, and if doesn't exist, then we take first record after the "hole".
Basically, we search for a record strictly greater than the saved value.
This way the semantics of the stored value would be that "the range that was seen/processed is everything up to and including the stored point".
Perhaps the idea was that the cursor is meant to point to a row which wasn't processed yet when we call savepoint()?
Then, indeed, we should back off one position when storing, and advance forward one position when restoring.
But your analysis seems to show that at least in this place in code we are calling it when the cursor points to already processed row.

Hello Jakub!

Thank you for sharing your thoughts!!!

Actually, it seems that there is one more nasty case which can happen when PCursor::savepoint()/resume() are called:

c) It can miss record which was on the page even before savepoint was taken.

This can happen when savepoint was taken when cursor was pointing at the user record (this can happen when parallel scan code calls bulk_inserter() closure from ddl0par-scan.cc), and the preceding record which position is really stored by savepoint() call is purged. In this case PCursor::resume() will do 2 steps forward instead of 1 and thus will advance cursor one record too far.

I was able to reproduce this issue and reported it as bug #115608.

Regarding to why this move_to_prev_on_page()/move_to_next_on_page() was added in the first place. I have two ideas:

1) First, if you take a look at row_merge_read_clustered_index() code in 5.7 (which more or less corresponds to code which we are discussing in 8.0) you will see that there is a case when we call btr_pcur_restore_position() after calling row_merge_insert_index_tuples(). In this case we actually use the fact that the position of previous record was stored, to redo processing of the last row, since it is thrown away when we discover that in memory buffer is too full.
We don't do move forward on restore in this case! So the fact that we do step back is actually used in this case. This logic is different in 8.0 where record during processing of which we discover that in-memory buffer is full and need insertion is not discarded, but added to the buffer after its flush.

2) I think, indeed, there is a problem with saving of cursor pointing to supremum and infimum records. AFAICS if cursor pointing to supremum is saved and then restored using optimistic branch of restore then we can encounter situation when some records from the next page will be moved over the cursor during the page merge and thus missed by the scan. We can observe similar situation with savepoint to infimum record in bug#114248 and bug#115518.

Hello Jakub!

Regarding mental model I mostly agree with you (there are some details which need to be carefully figured out though - e.g. btr_pcur_t::restore_position() restores position to saved or previous record if the saved record was purged, page boundary/supremum probably needs special handling).

Tentative fix for duplicate rows problem on top of fix for bug 115608 against MySQL 8.0.39 sources (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: ps-9214-bug115511-contrib.patch (text/x-patch), 14.32 KiB.

Hello!

I am contributing my tentative fix for this bug.

The patch is on top of my earlier contribution - fix for bug#115608 and against MySQL Server 8.0.39 sources.

Cumulative fix for bug115608 and 115511 against MySQL 8.0.39. (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: ps9144-ps9214-bug115608-bug115511-contrib.patch (text/x-patch), 17.35 KiB.

I have also added cumulative patch containing fix for both bug #115511 and fix for bug #115608 against MySQL 8.0.39 sources.

Best regards,
Dmitry

Thank you, Mr. Lenev, for your contribution !!!!!

Posted by developer:
 
Fixed as of the upcoming MySQL Server 8.0.41, 8.4.4, and 9.2.0 releases, and here's the proposed changelog entry from the documentation team for review:

ALTER TABLE, which rebuilds InnoDB tables using the INPLACE algorithm,
could potentially fail with a duplicate key error due to a non-duplicate
record being inserted concurrently while the rebuild was paused to release
a page latch.

Our thanks to Dmitry Lenev and the team at
Percona for contributing to this fix.

Thank you for the bug report.

Thank you , Philip.