commit c32fa9cff52dbc5a958d7ad5844f9ba38b5b5b4b
Author: Dmitry Lenev <dmitry.lenev@percona.com>
Date:   Tue Apr 29 11:36:38 2025 +0200

    Bug#117436 "PCursor::move_to_next_block may skip records incorrectly".
    
    Problem:
    --------
    ALTER TABLE which rebuilds InnoDB table using INPLACE algorithm might sometimes
    lead to row loss if concurrent purge happens on the table being ALTERed.
    
    Analysis:
    ---------
    This issue was introduced in version 8.0.41 as unwanted side-effect of fixes
    for bug#115608, in which similar problem is observed but in a different
    scenario, and bug#115511.
    
    New implementation of parallel ALTER TABLE INPLACE in InnoDB was introduced in
    MySQL 8.0.27. Its code is used for online table rebuild even in a single-thread
    case.
    This implementation iterates over all the rows in the table, in general case,
    handling different subtrees of a B-tree in different threads. This iteration
    over table rows needs to be paused, from time to time, to commit InnoDB MTR/
    release page latches it holds. This is necessary to give a way to concurrent
    actions on the B-tree scanned (this happens when switching to the next page)
    or before flushing rows of new version of table from in-memory buffer to the
    B-tree. In order to resume iteration after such pause persistent cursor
    position saved before pause is restored.
    
    The problem described above occurs when we try to save and then restore
    position of cursor pointing to page supremum, before switching to the next
    page. In post-8.0.41 code this is done by simply calling
    btr_pcur_t::store_position()/restore_position() methods for cursor that
    point to supremum. In 8.0.42-based code this is done in
    PCursor::save_previous_user_record_as_last_processed() and
    PCursor::restore_to_first_unprocessed() pair of methods.
    
    However, this doesn't work correctly in scenario, when after we have
    saved cursor position and then committed mini-transaction/released latches
    on the current page the next page is merged into the current one (after
    purge removes some records from it). In this case the cursor position is
    still restored as pointing to page supremum, and thus rows which were
    moved over by merge are erroneously skipped.
    
    ***
    
    Let us take look at an example. Let us assume that we have two pages
    p1 : [inf, a, b, c, sup] and the next one p2 : [inf, d, e, f, sup].
    Our thread which is one of the parallel ALTER TABLE worker threads
    has completed scan of p1, so its cursor positioned on p1:'sup' record.
    Now it needs to switch to page p2, but also give a way to threads
    concurrently updating the table. So it needs to make cursor savepoint,
    commit mini-transaction and release the latches.
    
    In post-8.0.41 code we simply do btr_pcur_t::store_position()/
    restore_position() with the cursor positioned on p1 : 'sup' record,
    then the following might happen: concurrent purge on page p2 might
    delete some record from it (e.g. 'f') and decide to merge of this page
    into the page p1.
    If this happens while latches are released this merge would go through
    and and resulting in page p1 with the following contents
    p1 : [inf, a, b, c, d, e, sup]. Savepoint for p1 : 'sup' won't
    be invalidated (one can say that savepoints for sup and inf are not
    safe against concurrent merges in this respect) and after restoration
    of cursor the iteration will continue, on the next page, skipping
    records 'd' and 'e'.
    
    ***
    
    Fix:
    ----
    This patch solves the problem by working around the issue with saving/
    restoring cursor pointing to supremum. Instead of storing position of
    supremum record PCursor::save_previous_user_record_as_last_processed()
    now stores the position of record that precedes it.
    And then PCursor::restore_to_first_unprocessed() does restore in two
    steps - 1) restores position of this preceding record (or its closest
    precedessor if it was purged meanwhile) and then 2) moves one step
    forward assuming that will get to the supremum record at which cursor
    pointed originally. If this is not true, i.e. there is user record
    added to the page by the merge (or simple concurrent insert), we
    assume that this and following records are unprocessed. The caller
    of PCursor::restore_to_first_unprocessed() detects this situation
    by checking if cursor is positioned on supremum and handles by
    resuming processing from record under the cursor if not.
    
    ***
    
    Let us return to the above example to explain how fix works.
    PCursor::save_previous_user_record_as_last_processed() does a step back
    before calling btr_pcur_t::store_position(), so for cursor positioned
    on p1 : 'sup' it is actually position corresponding to p1 : 'c' what
    is saved. If the merge happens when latches are released, we still
    get p1 : [inf, a, b, c, d, e, sup] and the savepoint is not invalidated.
    PCursor::restore_to_first_unprocessed() calls btr_pcur_t::restore_position()
    gets cursor pointing to p1 : 'c' as result, and then it tries to
    compensate for step-back and moves cursor one step forward making it to
    point to p1 : 'd'. Code which does  scanning detects the situation that
    saving/restoring resulted in jump from supremum record to user record
    and resume iteration  from p1 : 'd' without skipping any records.
    
    ***
    
    Thanks to Bytedance team for providing test case on which one used
    in this commit is based!!!

diff --git a/mysql-test/suite/innodb/r/alter_table_rebuild_missing_record.result b/mysql-test/suite/innodb/r/alter_table_rebuild_missing_record.result
index f1065af246f..b6cb99dddce 100644
--- a/mysql-test/suite/innodb/r/alter_table_rebuild_missing_record.result
+++ b/mysql-test/suite/innodb/r/alter_table_rebuild_missing_record.result
@@ -172,3 +172,90 @@ Warnings:
 Note	1051	Unknown table 'test.t1'
 DROP PROCEDURE test.insert_records;
 disconnect con1;
+#
+# Bug#117436 "PCursor::move_to_next_block may skip records incorrectly".
+#
+# Execution of ALTER TABLE INPLACE might sometimes result in row
+# loss when merge of pages in table's B-tree is happening concurrently.
+#
+# Rows of the table should be big enough to easily cause page split,
+# OTOH they should be small enough to be easily merged back as well.
+# In our case:
+#       5 * 3300 > available space for rows in 16k page,
+#       2 * 3300 < 8k which is merge threshold for 16k page.
+CREATE TABLE t1(id INT PRIMARY KEY, val VARCHAR(3300));
+INSERT INTO t1 VALUES (1, REPEAT('1', 3300));
+INSERT INTO t1 VALUES (2, REPEAT('2', 3300));
+INSERT INTO t1 VALUES (3, REPEAT('3', 3300));
+INSERT INTO t1 VALUES (4, REPEAT('4', 3300));
+INSERT INTO t1 VALUES (5, REPEAT('5', 3300));
+INSERT INTO t1 VALUES (7, REPEAT('7', 3300));
+#
+# At this point leaf pages of B-tree for our table look like:
+#
+#              [1,2] - [3,4,5,7]
+#
+# The below insert will cause split of the second leaf page, so
+# we end up with the following picture of leaf pages after it:
+#
+#           [1,2] - [3,4] - [5,6,7]
+INSERT INTO t1 VALUES (6, REPEAT('6', 3300));
+#
+# Delete marks middle row in the third leaf page as deleted.
+# Once purge is enabled, the row will be removed from the page
+# for real, causing the contents of the third leaf page to
+# be merged into second:
+#
+# [1,2] - [3,4] - [5,*6,7]  =>  [1,2] - [3,4,5,7]
+SET GLOBAL innodb_purge_stop_now = ON;
+DELETE FROM t1 WHERE id=6;
+connect con1, localhost, root,,;
+# Make ALTER TABLE to process all leaf pages as a single scan.
+SET GLOBAL DEBUG="+d,parallel_reader_force_single_range";
+# Ensure that we always release latches when moving from
+# page to page.
+SET GLOBAL DEBUG="+d,pcursor_move_to_next_block_release_latches";
+SET DEBUG_SYNC="pcursor_move_to_next_block_latches_released SIGNAL latches_released WAIT_FOR continue EXECUTE 2";
+# Send ALTER TABLE t1 ENGINE=InnoDB, ALGORITHM=INPLACE
+ALTER TABLE t1 ENGINE=InnoDB, ALGORITHM=INPLACE;
+connection default;
+# Wait until we reach end of the first leaf page:
+# [1,2 <we are here>] - [3, 4] - ...
+# And signal ALTER TABLE to proceed to the next leaf page.
+SET DEBUG_SYNC="now WAIT_FOR latches_released";
+SET DEBUG_SYNC="now SIGNAL continue";
+# Wait until we reach end of the second leaf page:
+# [1,2] - [3,4 <we are here>] - [5,*6,7]
+SET DEBUG_SYNC="now WAIT_FOR latches_released";
+# Unleash the purge and wait till it completes, so row 6 is
+# removed for real and the third leaf page is merged into the
+# second one.
+SET GLOBAL innodb_purge_run_now = ON;
+#
+# We should end up with:
+#        [1,2] - [3,4, <we are here> 5,7]
+#
+# While before the fix we got:
+#        [1,2] - [3,4,5,7 <we are here>]
+#
+# Resume ALTER TABLE.
+SET DEBUG_SYNC="now SIGNAL continue";
+connection con1;
+# Reap ALTER TABLE.
+# Check that all rows are present in table (except deleted row 6).
+# Before the fix rows 5 and 7 were missing as well.
+SELECT id FROM t1;
+id
+1
+2
+3
+4
+5
+7
+# Clean up.
+SET GLOBAL DEBUG="-d,pcursor_move_to_next_block_release_latches";
+SET GLOBAL DEBUG="-d,parallel_reader_force_single_range";
+SET DEBUG_SYNC='RESET';
+disconnect con1;
+connection default;
+DROP TABLE t1;
diff --git a/mysql-test/suite/innodb/t/alter_table_rebuild_missing_record.test b/mysql-test/suite/innodb/t/alter_table_rebuild_missing_record.test
index ed1fc55902b..dc4ddd23f0b 100644
--- a/mysql-test/suite/innodb/t/alter_table_rebuild_missing_record.test
+++ b/mysql-test/suite/innodb/t/alter_table_rebuild_missing_record.test
@@ -240,6 +240,107 @@ SET GLOBAL innodb_limit_optimistic_insert_debug=0;
 DROP TABLE IF EXISTS t1;
 DROP PROCEDURE test.insert_records;
 --disconnect con1
+
+--echo #
+--echo # Bug#117436 "PCursor::move_to_next_block may skip records incorrectly".
+--echo #
+--echo # Execution of ALTER TABLE INPLACE might sometimes result in row
+--echo # loss when merge of pages in table's B-tree is happening concurrently.
+
+--echo #
+--echo # Rows of the table should be big enough to easily cause page split,
+--echo # OTOH they should be small enough to be easily merged back as well.
+--echo # In our case:
+--echo #       5 * 3300 > available space for rows in 16k page,
+--echo #       2 * 3300 < 8k which is merge threshold for 16k page.
+CREATE TABLE t1(id INT PRIMARY KEY, val VARCHAR(3300));
+INSERT INTO t1 VALUES (1, REPEAT('1', 3300));
+INSERT INTO t1 VALUES (2, REPEAT('2', 3300));
+INSERT INTO t1 VALUES (3, REPEAT('3', 3300));
+INSERT INTO t1 VALUES (4, REPEAT('4', 3300));
+INSERT INTO t1 VALUES (5, REPEAT('5', 3300));
+INSERT INTO t1 VALUES (7, REPEAT('7', 3300));
+
+--echo #
+--echo # At this point leaf pages of B-tree for our table look like:
+--echo #
+--echo #              [1,2] - [3,4,5,7]
+
+--echo #
+--echo # The below insert will cause split of the second leaf page, so
+--echo # we end up with the following picture of leaf pages after it:
+--echo #
+--echo #           [1,2] - [3,4] - [5,6,7]
+
+INSERT INTO t1 VALUES (6, REPEAT('6', 3300));
+
+--echo #
+--echo # Delete marks middle row in the third leaf page as deleted.
+--echo # Once purge is enabled, the row will be removed from the page
+--echo # for real, causing the contents of the third leaf page to
+--echo # be merged into second:
+--echo #
+--echo # [1,2] - [3,4] - [5,*6,7]  =>  [1,2] - [3,4,5,7]
+
+SET GLOBAL innodb_purge_stop_now = ON;
+DELETE FROM t1 WHERE id=6;
+
+--connect(con1, localhost, root,,)
+--echo # Make ALTER TABLE to process all leaf pages as a single scan.
+SET GLOBAL DEBUG="+d,parallel_reader_force_single_range";
+--echo # Ensure that we always release latches when moving from
+--echo # page to page.
+SET GLOBAL DEBUG="+d,pcursor_move_to_next_block_release_latches";
+SET DEBUG_SYNC="pcursor_move_to_next_block_latches_released SIGNAL latches_released WAIT_FOR continue EXECUTE 2";
+
+--echo # Send ALTER TABLE t1 ENGINE=InnoDB, ALGORITHM=INPLACE
+--send ALTER TABLE t1 ENGINE=InnoDB, ALGORITHM=INPLACE
+
+--connection default
+--echo # Wait until we reach end of the first leaf page:
+--echo # [1,2 <we are here>] - [3, 4] - ...
+--echo # And signal ALTER TABLE to proceed to the next leaf page.
+SET DEBUG_SYNC="now WAIT_FOR latches_released";
+SET DEBUG_SYNC="now SIGNAL continue";
+
+--echo # Wait until we reach end of the second leaf page:
+--echo # [1,2] - [3,4 <we are here>] - [5,*6,7]
+SET DEBUG_SYNC="now WAIT_FOR latches_released";
+
+--echo # Unleash the purge and wait till it completes, so row 6 is
+--echo # removed for real and the third leaf page is merged into the
+--echo # second one.
+SET GLOBAL innodb_purge_run_now = ON;
+--source include/wait_innodb_all_purged.inc
+
+--echo #
+--echo # We should end up with:
+--echo #        [1,2] - [3,4, <we are here> 5,7]
+--echo #
+--echo # While before the fix we got:
+--echo #        [1,2] - [3,4,5,7 <we are here>]
+--echo #
+--echo # Resume ALTER TABLE.
+SET DEBUG_SYNC="now SIGNAL continue";
+
+--connection con1
+--echo # Reap ALTER TABLE.
+--reap
+
+--echo # Check that all rows are present in table (except deleted row 6).
+--echo # Before the fix rows 5 and 7 were missing as well.
+SELECT id FROM t1;
+
+--echo # Clean up.
+SET GLOBAL DEBUG="-d,pcursor_move_to_next_block_release_latches";
+SET GLOBAL DEBUG="-d,parallel_reader_force_single_range";
+SET DEBUG_SYNC='RESET';
+
+--disconnect con1
+
+--connection default
+DROP TABLE t1;
+
 --disable_connect_log
 # Wait till all disconnects are completed
 --source include/wait_until_count_sessions.inc
diff --git a/storage/innobase/row/row0pread.cc b/storage/innobase/row/row0pread.cc
index 43dd5e5188b..d47b01c35a8 100644
--- a/storage/innobase/row/row0pread.cc
+++ b/storage/innobase/row/row0pread.cc
@@ -231,7 +231,7 @@ class PCursor {
 
   /** This method must be called after all records on a page are processed and
   cursor is positioned at supremum. Under this assumption, it stores the
-  position BTR_PCUR_AFTER the last user record on the page.
+  position of the last user record on the page.
   This method must be paired with restore_to_first_unprocessed() to restore to
   a record which comes right after the value of the stored last processed
   record @see restore_to_first_unprocessed for details. */
@@ -433,13 +433,24 @@ void PCursor::restore_to_last_processed_user_record() noexcept {
 void PCursor::save_previous_user_record_as_last_processed() noexcept {
   ut_a(m_pcur->is_after_last_on_page());
   ut_ad(m_read_level == 0);
+  /*
+    Instead of simply taking savepoint for cursor pointing to supremum we
+    are doing one step back. This is necessary to prevent situation when
+    concurrent merge from the next page, which happens after we commit
+    mini-transaction/release latches, moves records over supremum to the
+    current page. In this case the optimistic restore of cursor pointing
+    to supremum will result in cursor pointing to supremum, which means
+    moved records will be incorrectly skipped by the scan.
+
+    So, effectively, we are saving position of last user record which we
+    have processed/which corresponds to last key value we have processed!
+  */
+  m_pcur->move_to_prev_on_page();
   m_pcur->store_position(m_mtr);
-  ut_a(m_pcur->m_rel_pos == BTR_PCUR_AFTER);
   m_mtr->commit();
 }
 
 void PCursor::restore_to_first_unprocessed() noexcept {
-  ut_a(m_pcur->m_rel_pos == BTR_PCUR_AFTER);
   ut_ad(m_read_level == 0);
   m_mtr->start();
   m_mtr->set_log_mode(MTR_LOG_NO_REDO);
@@ -448,13 +459,21 @@ void PCursor::restore_to_first_unprocessed() noexcept {
   /* Restored cursor is positioned on the page at the level intended before */
   ut_ad(m_read_level == btr_page_get_level(m_pcur->get_page()));
 
-  /* The BTR_PCUR_IS_POSITIONED_OPTIMISTIC only happens in case of a successful
-  optimistic restoration in which the cursor points to a user record after
-  restoration. But, in save_previous_user_record_as_last_processed() the cursor
-  pointed to SUPREMUM before calling m_ptr->store_position(mtr), so it would
-  also point there if optimistic restoration succeeded, which is not a user
-  record. */
-  ut_ad(m_pcur->m_pos_state != BTR_PCUR_IS_POSITIONED_OPTIMISTIC);
+  /*
+    The cursor points to last processed record, or, if it has been purged
+    meanwhile, its closest non-purged predecessor.
+    By moving to the successor of the saved record we position the cursor
+    either to supremum record (which means we restored the original cursor
+    position and can continue switch to the next page as usual) or to
+    some user record which our scan have not processed yet (for example,
+    this record might have been moved from the next page due to page
+    merge or simply inserted to our page concurrently).
+
+    The latter case is detected by caller by doing !is_after_last_on_page()
+    check and instead of doing switch to the next page we continue processing
+    from the restored user record.
+  */
+  m_pcur->move_to_next_on_page();
 }
 
 bool PCursor::restore_to_largest_le_position_saved() noexcept {
@@ -1325,6 +1344,9 @@ dberr_t Parallel_reader::Scan_ctx::create_ranges(const Scan_range &scan_range,
     start = nullptr;
 
     page_cur_move_to_next(&page_cursor);
+
+    DBUG_EXECUTE_IF("parallel_reader_force_single_range",
+                    page_cur_set_after_last(block, &page_cursor););
   }
 
   savepoints.push_back(savepoint);