MySQL Bugs: #117898: User Records may be skipped when Persistent Cursor restores Infimum and Supremum Record

Bug #117898	User Records may be skipped when Persistent Cursor restores Infimum and Supremum Record
Submitted:	7 Apr 8:26	Modified:	1 May 14:38
Reporter:	Xizhe Zhang (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S3 (Non-critical)
Version:	8.0.41	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
When I was reading the code, I found some problems in the implementation of Persistent Cursor 'btr_pcur_t', and there were some subtle errors in the handling of Infimum Record and Supremum Record.

I will now review the implementation of the 'store_position' and 'restore_position' methods of 'btr_pcur_t'.

# store_position, set 'm_rel_pos':
-- BTR_PCUR_ON: The 'm_btr_cur' is at a User Record, store this Record to 'm_old_rec', and 'm_btr_cur' is still at this Record
-- BTR_PCUR_AFTER: The 'm_btr_cur' is at Supremum Record, store the Prev Record (User Record) to 'm_old_rec', but 'm_btr_cur' is still at Supremum Record
-- BTR_PCUR_BEFORE: The 'm_btr_cur' is at Infimum Record, store the Next Record (User Record) to 'm_old_rec', but 'm_btr_cur' is still at Infimum Record

# restore_position, process according to 'm_rel_pos':
-- BTR_PCUR_ON: The location to which it is restored depends only on whether 'm_old_rec' has been purged:
  -> Not purged: Restore to the User Record that is equal to 'm_old_rec', return  true
  -> Purged: Restore to the previous Record of 'm_old_rec' (Search Mode PAGE_CUR_LE), return false

-- BTR_PCUR_AFTER is expected to locate the Next Record of 'm_old_rec', but the actual implementation does not meet expectations:
  -> Optimistic Restore: It will directly restore to the original Supremum Record, which will cause the newly inserted Records between 'm_old_rec' and Supremum to be skipped. # Bug 1
  -> Pessimistic Restore: It will restore to the next Record of 'm_old_rec' (Search Mode PAGE_CUR_G). If this Record is a Supremum Record, it is fine. But if it is a User Record, When the 'restore_position' function finally calls 'store_position', 'm_rel_pos' will be changed to BTR_PCUR_ON instead of BTR_PCUR_AFTER. If it is a moves_up traversal, the cursor will be moved to next in 'sel_restore_position_for_mysql', and the processing of this Record will be skipped. # Bug 2

-- BTR_PCUR_BEFORE is expected to locate the Prev Record of 'm_old_rec', but the actual implementation does not meet expectations too:
  -> Optimistic Restore: It will directly restore to the original Infimum Record, which will cause the newly inserted Records between Infimum and 'm_old_rec' to be skipped. # Bug 1
  -> Pessimistic Restore: It will restore to the prev Record of 'm_old_rec' (Search Mode PAGE_CUR_L), and there is also the problem of changing 'm_rel_pos' to BTR_PCUR_ON, but because the processing of BTR_PCUR_ON and BTR_PCUR_BEFORE modes in 'sel_restore_position_for_mysql' is almost the same, there is no problem of skip Record processing.

In summary, when functions such as 'sel_restore_position_for_mysql' and 'row_sel_restore_pcur_pos', the expectations for restoring the positions of Infimum Record and Supremum Record are broken, which may cause some Record processing to be skipped. There are two bugs:
1. After Optimistic Restore, BTR_PCUR_AFTER and BTR_PCUR_BEFORE do not locate the next or previous record of 'm_old_rec', but directly use the Supremum Record or Infimum Record, which results in skipping the processing of the newly inserted Records in the middle.
2. After Pessimistic Restore, if a User Record is located, 'm_rel_pos' will be changed to BTR_PCUR_ON, which will cause an incorrect judgment when adjusting the position in 'sel_restore_position_for_mysql', and may skip the processing of a User Record.

How to repeat:
Bug 1 can be reproduced by traversing in reverse order (because 'move_backward_from_page' will store at the Infimum Record and then restore), Bug 2 only occurs on the Supremum Record. I didn't think of a good way to reproduce Bug 2, but during the debugging process of reproducing Bug 1, it can be found that 'm_rel_pos' is indeed changed to BTR_PCUR_ON.

I wrote a test case for Bug 1, but you need to add a Debug Sync Point to the code and then compile the Debug binary for testing.

diff --git a/storage/innobase/btr/btr0pcur.cc b/storage/innobase/btr/btr0pcur.cc
index 8745ead0124..557c33f0b88 100644
--- a/storage/innobase/btr/btr0pcur.cc
+++ b/storage/innobase/btr/btr0pcur.cc
@@ -35,6 +35,8 @@ this program; if not, write to the Free Software Foundation, Inc.,
 
 #include <stddef.h>
 
+#include "current_thd.h"
+#include "debug_sync.h"
 #include "rem0cmp.h"
 #include "trx0trx.h"
 #include "ut0byte.h"
@@ -375,6 +377,8 @@ void btr_pcur_t::move_backward_from_page(mtr_t *mtr) {
 
   mtr_commit(mtr);
 
+  DEBUG_SYNC(current_thd, "move_backward_from_page_sync");
+
   mtr_start(mtr);
 
   restore_position(latch_mode2, mtr, UT_LOCATION_HERE);

diff --git a/storage/innobase/include/btr0cur.ic b/storage/innobase/include/btr0cur.ic
index 360d3d02b29..4e871eef7fb 100644
--- a/storage/innobase/include/btr0cur.ic
+++ b/storage/innobase/include/btr0cur.ic
@@ -145,6 +145,9 @@ static inline bool btr_cur_can_delete_without_compress(
 
   page = btr_cur_get_page(cursor);
 
+  DBUG_EXECUTE_IF("innobase_stop_merge",
+                  if (strstr(cursor->index->table_name, "t1")) return true;);
+
   if ((page_get_data_size(page) - rec_size <
        BTR_CUR_PAGE_COMPRESS_LIMIT(cursor->index)) ||
       ((btr_page_get_next(page, mtr) == FIL_NULL) &&

The following is the content of the test case for Bug 1:

--source include/have_debug.inc
--source include/have_debug_sync.inc

--connect(con1,localhost,root)
--connect(con2,localhost,root)

--connection con1
SET transaction_isolation = "READ-UNCOMMITTED";
SET GLOBAL innodb_limit_optimistic_insert_debug = 4;
SET GLOBAL debug = "+d,innobase_stop_merge";
--sleep 5

CREATE TABLE t1 (c INTEGER NOT NULL, PRIMARY KEY (c)) ENGINE=INNODB;

--echo # Root Page: (1,3)
--echo # Leaf Page: (1,2), (3,6,8)
INSERT INTO t1 VALUES (1), (2), (3), (6), (8);

--echo # 1. Scan the table in reverse order
SELECT * FROM t1 ORDER BY c DESC;

--echo # 2. Scan the table while inserting a row, optimistic restore
--echo # Root Page: (1,3)
--echo # Leaf Page: (1,2), (6,8)
DELETE FROM t1 WHERE c = 3;
--source include/wait_innodb_all_purged.inc

SET debug_sync = "move_backward_from_page_sync SIGNAL continue_insert WAIT_FOR continue_scan";
send SELECT * FROM t1 ORDER BY c DESC;

--connection con2
SET debug_sync = "now WAIT_FOR continue_insert";
--echo # Root Page: (1,3)
--echo # Leaf Page: (1,2), (5,6,8)
INSERT INTO t1 VALUES (5);
SET debug_sync = "now SIGNAL continue_scan";

--connection con1
reap;

--echo # 3. Scan the table while inserting a row, pessimistic restore
SET debug_sync = "move_backward_from_page_sync SIGNAL continue_insert WAIT_FOR continue_scan";
send SELECT * FROM t1 ORDER BY c DESC;

--connection con2
SET debug_sync = "now WAIT_FOR continue_insert";
--echo # Root Page: (1,3,6)
--echo # Leaf Page: (1,2), (4,5), (6,7,8) // Page Split causes modify_clock+1, need to restore pessimisticly
INSERT INTO t1 VALUES (4);
INSERT INTO t1 VALUES (7);
SET debug_sync = "now SIGNAL continue_scan";

--connection con1
reap;

--echo # 4. Cleanup
SET GLOBAL innodb_limit_optimistic_insert_debug = 0;
SET GLOBAL debug = "-d,innobase_stop_merge";
DROP TABLE t1;

-------------------------------------------------------- Dividing Line

Check the result file you can find:
...
# 2. Scan the table while inserting a row, optimistic restore
...
c
8
6 // The 5 just inserted is not printed, Bug 1
2
1
# 3. Scan the table while inserting a row, pessimistic restore
...
c
8
6
5
4 // The 4 just inserted is printed, that's right
2
1

Hello Xizhe Zhang,

Thank you for the report and feedback.

regards,
Umesh

Hello, Verification Team, this is my patch. (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: persistent_cusor_bugfix.diff (application/octet-stream, text), 12.88 KiB.

Let me explain the contents of the bugfix:

1. For Optimistic Restore, store_position will actually move the Cursor to 'm_old_rec', so that the Cursor position after optimistic restore is at 'm_old_rec'. BTR_PCUR_AFTER needs to move next once, and BTR_PCUR_BEFORE needs to move prev once, which can guarantee our expectations for these two states.
  -- At the same time, BTR_PCUR_IS_POSITIONED_OPTIMISTIC is currently a completely unused state (because the Cursor is at Infimum or Supremum when store_position is called, optimistic restore cannot be positioned on the User Record, so the judgment fails) and can be removed from the code.

2. For Pessimistic Restore, if a User Record is located, 'm_rel_pos' will be changed to BTR_PCUR_ON, which requires restoring 'm_rel_pos' back to BTR_PCUR_BEFORE or BTR_PCUR_AFTER.

Posted by developer:
 
Hello Xizhe Zhang,
this is an impressive analysis! Especially given the state of this part of the code.

Let me start, by confirming, that this code is very shaky, and arguably buggy, as evidenced by our recent efforts around
Bug#37318367 	Inplace ALTER on table wth spatial idx might cause lost rows if concurrent purge 
Bug#36846567 	Inplace ALTER TABLE might cause lost rows if concurrent purge 
Bug#35303494 	Data will be lost when the table structure is rebuilt. 
Bug#36261480 	alter table tablename engine=innodb cause data loss 
Bug#35168007 	InnoDB report Assertion failure at histogram sampler when transaction rollback 
Bug#36808088 	Inplace ALTER TABLE might fail with duplicate key error if concurrent insertions 
Also, I mostly agree with your description of what the code does, or would do, if executed.
Also, thanks to you, I've learned to my surprise that modify_clock is not being bumped when inserting a new record to a page.

Now, where I beg to differ is in assessment of the severity, and interpretation of what the code *should* do.
Given how shaky this code is, we have to abandon any hope of deriving meaning from names of constants or functions, and have to resort to analysing what the code actually does, and how the caller uses it, then see if it can lead to any data loss or wrong results.

For example, you wrote:
> BTR_PCUR_AFTER is expected to locate the Next Record of 'm_old_rec'
which sounds quite reasonable based on the name of the constant, and other clues. 
However, I would be more hesitant with judgement, until I see what exactly the caller of this code was trying to achieve and expect. One could imagine that the intent could be any of:
(a) restore the cursor to the user record which is the smallest record which is greater than the one in m_old_rec
(b) ... OR a supremum record which is right before it, if it is the first record on its page
(c) ... OR infimum record which is right before it, if it is the first record on its page
(d) something else
More importantly, we can't forget the context in which the btr_pcur_t was being stored: if at that time it pointed to a SUPREMUM, we might need to establish what was that signifying to the caller:
(A) that everything up to and including the user record preceeding the SUPREMUM was processed, and we'd wish to restart processing from the first record greater than that user record
(B) that everything up to the division point between this page and the next one (i.e. everything smaller than the min record value of the next page stored in the common ancestor of this page and the next page) was processed, and we'd wish to restart processing from the value greater or equal to this division point
(C) that everything up to the smallest user record on the next page was processed,...
(D) something else

Also, it's complicated by many other facts like:
- there are various wrapper functions around restore_position, which do additional corrections
- there are various wrapper functions an logic around store_position (for example direct modification of `pcur->m_rel_pos = BTR_PCUR_BEFORE;` in row_search_mvcc).
Often a bug in the inner layer, interacts with a bug in the wrapper, cancelling each other, so we have to thread carefully.
Fixing just one "bug", might actually introduce a data-loss or cause wrong results when taken as a whole.

Your report brings attention to several "fishy behaviours":
1. When m_rel_pos is BTR_PCUR_AFTER and cursor was pointing to SUPREMUM and btr_pcur_t::restore_position() succeeds with Optimistic Restore, the cursor will be again at the SUPREMUM, but (because we don't bump modify_clock on inserts) new user records could have been added between m_old_rec and SUPREMUM meanwhile
2. When m_rel_pos is BTR_PCUR_AFTER and cursor was pointing to SUPREMUM and btr_pcur_t::restore_position() fails with Optimistic Restore, the cursor might point to the user record which is the smallest record greater than m_old_rec. 
2.1. This alone is actually not so strange, but clearly different than behaviour in 1., so one has to wonder if both of them can be correct at the same time, given that which of the two results we will obtain depends on non-deterministic factors like the state of the BP.
2.2. Worse, you show that `sel_restore_position_for_mysql()` would react to such case by advancing the cursor one step forward, which can't be correct under any plausible interpretation, as it might skip over a record which existed even before the whole scenario has started

and somewhat symmetric:

3. When m_rel_pos is BTR_PCUR_BEFORE and cursor was on INFIMUM and btr_pcur_t::restore_position() succeeds with Optimistic Restore, the cursor will be again at the INFIMUM, but (because we don't bump modify_clock on inserts) new user records could have been added between INFIMUM and m_old_rec meanwhile
4. When m_rel_pos is BTR_PCUR_BEFORE and cursor was pointing to INFIMUM and btr_pcur_t::restore_position() fails with Optimistic Restore, the cursor might point to the user record which is the largest record smaller than m_old_rec. 
4.1. This alone is actually not so strange, but clearly different than behaviour in 3., so one has to wonder if both of them can be correct at the same time

I agree that this is what the code does.
Now let's asses:
- can it actually happen?
- is it bad?

You provide test case for "fishy behaviour 3.", so we know it happens, thanks.

Now, is it bad? 

I would claim it is very benign, because "not seeing a row which was not yet committed before your query has started" is not a data loss.
In particular, if that other transaction inserted the row slightly later, or in slightly different position (say, in the middle of the page, far too the right of the cursor), we would not consider it a bug that the scan has not reported it, would we?
It is quite expected and natural for READ UNCOMITTED and READ COMMITTED, which do not lock the gaps, that other transactions might insert rows into the scanned area, and even commit before we finish the query.
What would count as unexpected and wrong, would be if READ UNCOMITTED failed to report a row which was already there before the scan even started.
But, this bug report does not demonstrate such a thing, and I doubt it could happen.

One way to dismiss this test scenario as "not a bug" is to simply adopt an interpretation in which store_position() called at INFIMUM is meant to mean that 
"we have processed everything down to and including the position on the axis of possible values which is right after the previous page, i.e. 2+epsilon, and upon restoration we wish to continue from the largest user record which is smaller than 2+epsilon, or simply smaller or equal to 2"
OR
"we have processed everything down to and including the position on the axis of possible values which corresponds to the split point in the ancestor, i.e. 3, and upon restoration we wish to continue from the largest user record which is smaller than 3"
But, I guess your favored interpretation is:
"we have processed everything down to and including the last user record we saw, which was 5, so upon restoration we wish to continue from the largest user record which is smaller than 5"
which indeed is not satisfied, and therefore you see it as a bug.

Note however, that all 3 approaches seem to be fine conceptually, as basically we have a lot of freedom in interpration of "the gap" to the left or 5: we can claim that any position in left-side open range (2,5] is "the last processed position". (Or at least any in the range [3,5], as I could imagine some argumentation for why we can't claim to process anything which is to the left of the division point, without even trying to latch the left sibling and looking into its content)

This line of thinking, also makes me judge "4.1" as not a problem: yes we do a slightly different thing depending on BP state, but one could argue we simply use two valid (but different) choices of "the last processed position" in (2,5] range:
- for Optimistic Restoration, we can think of what's happening as if "the last processed position" was 3
- for Pessimistic Restoration, we can interpret it as if "the last processed position" was 5
and what follows is "attempt to restore it to a record which is to the left of the alst processed position":
- for Optimistic Restoration, we end up on INFIMUM
- for Pessimistic Restoration, we end up on whatever is now to the left of 5, could be the newly inserted 4 for example, or the 2 on the next page if nothing was inserted, etc.

At any rate I can't see how 3. or 4. can lead to "data loss" understood as "committed state is not visible to a transaction which started later".

I am much more worried with 1 and 2!

For 1. I am worried, because I know of at least one place in code which really assumed that no record can be inserted in between m_old_rec and SUPREMUM, (based on my wrong understanding of modify_clock):
PCursor::save_previous_user_record_as_last_processed() is being called when cursor points to SUPREMUM, and the corresponding 
PCursor::restore_to_first_unprocessed() is expected to put the cursor on the first not processed record. The hope was that in case of Optimistic Restoration to SUPREMUM, we will not miss any unprocessed record, but you've convinced me we will.
Thankfully, turns out that despite this misunderstanding (and consequently wrong code comments), this doesn't lead to any real problem, because:
- save_previous_user_record_as_last_processed() is used in context where we do a consistent read, so newly inserted records shouldn't be seen anyway due to MVCC
- save_previous_user_record_as_last_processed() is currently used 
	- for SPATIAL index DDLs in case where the thread holds MDL preventing INSERTS (and DELETEs, but, not Undo Log Purge, so physical deletions of records can still occur)
	- in move_to_next_block_at_leaf_level() used by PCursor::move_to_next_block(dict_index_t *index) used by Parallel_reader::Ctx::move_to_next_node(PCursor *pcursor) used by Parallel_reader::Ctx::traverse_recs(PCursor *pcursor, mtr_t *mtr) used by Parallel_reader::Ctx::traverse(). A context in which it is seems ok to skip rows which were inserted after the scan has started (most of the time we seem to be using MVCC, and for those occassions where we don't I don't see how we could rely on visiting "everything")

I've added an assert to the code to check for situations where we try to store_position when pointing to SUPREMUM, and in addition to PCursor::save_previous_user_record_as_last_processed() mentioned above we also do so:
- when reaching end of index. But this is not relevant much, as in such case we rarely continue the scan. But sometimes we do continue! This happens when the row_search_mvcc() has buffered several rows before reaching end of index, so DB_SUCCESS was returned to the iterator, the iterator made a few more rounds using up the buffered data, and meanwhile someone has inserted even more data to the table. This doesn't seem to be a problem though in practice. Either the scan was LOCKing, in which case it has places a lock before the last SUPREMUM preventing more inserts at the end of the table, or the scan was using MVCC, in which case those newly inserted rows aren't seen anyway, or the scan is using READ UNCOMITTED in which case we have not promissed anything. Again, I don't think we ever promissed anyone to see rows inserted after the scan has started.
- when aborting the query:
```
  if (trx_is_interrupted(trx)) {
    if (!spatial_search) {
      pcur->store_position(&mtr);
    }
    err = DB_INTERRUPTED;
    goto normal_return;
  }

  /*-------------------------------------------------------------*/
  /* PHASE 4: Look for matching records in a loop */
``` 
which is again probably not a big deal, as we will not continue to use this cursor anyway.
- in `IndexPurge::next()`, where hopefully no other thread is modifying the table being imported, so we don't have to worry about records being inserted or deleted

The "fishy behaviour 2.1." is exactly what we want in PCursor::restore_to_first_unprocessed(), and other places do not care.

The most worrying thing you've reported is "fishy behaviour 2.2.", because in theory it means a data-loss: skipping over a record which was there even before the scan has started.
Thankfully, it does not seem to occur in practice, because sel_restore_position_for_mysql() does not seem to be used in situations where cursor was at SUPREMUM, yet not at the end of the index.
I've verified that by using following patch:
```
diff --git a/storage/innobase/btr/btr0pcur.cc b/storage/innobase/btr/btr0pcur.cc
index 9bf34cea0276..0ade194db577 100644
--- a/storage/innobase/btr/btr0pcur.cc
+++ b/storage/innobase/btr/btr0pcur.cc
@@ -51,6 +51,7 @@ void btr_pcur_t::store_position(mtr_t *mtr) {
   auto rec = page_cur_get_rec(page_cursor);
   auto page = page_align(rec);
   auto offs = page_offset(rec);
+  m_was_on_supremum_not_end = false;

 #ifdef UNIV_DEBUG
   if (dict_index_is_spatial(index)) {
@@ -94,6 +95,7 @@ void btr_pcur_t::store_position(mtr_t *mtr) {
     rec = page_rec_get_prev(rec);

     m_rel_pos = BTR_PCUR_AFTER;
+    m_was_on_supremum_not_end = !is_after_last_in_tree(mtr);

   } else if (page_rec_is_infimum_low(offs)) {
     rec = page_rec_get_next(rec);
diff --git a/storage/innobase/include/btr0pcur.h b/storage/innobase/include/btr0pcur.h
index 24aa680248b4..b1dbaff1d142 100644
--- a/storage/innobase/include/btr0pcur.h
+++ b/storage/innobase/include/btr0pcur.h
@@ -469,6 +469,8 @@ struct btr_pcur_t {
   whether cursor was on, before, or after the old_rec record */
   btr_pcur_pos_t m_rel_pos{BTR_PCUR_UNSET};

+  bool m_was_on_supremum_not_end{false};
+
   /** buffer block when the position was stored */
   buf::Block_hint m_block_when_stored;

diff --git a/storage/innobase/row/row0sel.cc b/storage/innobase/row/row0sel.cc
index 5c99fd148b93..f670559dccca 100644
--- a/storage/innobase/row/row0sel.cc
+++ b/storage/innobase/row/row0sel.cc
@@ -3409,6 +3409,7 @@ static bool sel_restore_position_for_mysql(
     mtr_t *mtr)          /*!< in: mtr; CAUTION: may commit
                          mtr temporarily! */
 {
+  const auto was_on_supremum_not_end = pcur->m_was_on_supremum_not_end;
   auto success = pcur->restore_position(latch_mode, mtr, UT_LOCATION_HERE);

   *same_user_rec = success;
@@ -3431,6 +3432,7 @@ static bool sel_restore_position_for_mysql(
       ut_o(return (true));
     case BTR_PCUR_ON:
       if (!success && moves_up) {
+        ut_a(!was_on_supremum_not_end);
       next:
         pcur->move_to_next(mtr);
         return true;
@@ -3440,6 +3442,7 @@ static bool sel_restore_position_for_mysql(
     case BTR_PCUR_BEFORE_FIRST_IN_TREE:
       return true;
     case BTR_PCUR_AFTER:
+      ut_a(!was_on_supremum_not_end);
       /* positioned to record after pcur->old_rec. */
       pcur->m_pos_state = BTR_PCUR_IS_POSITIONED;
     prev:
```
and running the whole test suite and it seemed to hold.

As for your patch, I like your idea at hand-waving level.
"Something like this" will probably be needed.
But I am not sure about some of the details. 
I like the general idea that we should try to get to the point where having the m_rel_pos equal to BTR_PCUR_AFTER/BEFORE/ON before calling restore_position() means it will be restored to a record with corresponding relation to m_old_rec.
What I am not sure about is what should be the contract for the value of this field once the restore_position() returns.
In particular what's not clear to me is:
- what are the consequences of not keeping the original m_rel_pos when returning from restore_position()
- what are the consequences of returning false in case of optimistic restoration
- why do we have this code for moving to next record in case of BTR_PCUR_BEFORE && moves_up
- why do we have this code for moving to next record in case of BTR_PCUR_AFTER && !moves_up
The test coverage for some of these cases is very low.
For example here are coverage numbers for existing code:
```
3412   235362941 :   auto success = pcur->restore_position(latch_mode, mtr, UT_LOCATION_HERE);
3413             : 
3414   235362938 :   *same_user_rec = success;
3415             : 
3416   235362938 :   ut_ad(!success || pcur->m_rel_pos == BTR_PCUR_ON);
3417             : #ifdef UNIV_DEBUG
3418   235362938 :   if (pcur->m_pos_state == BTR_PCUR_IS_POSITIONED_OPTIMISTIC) {
3419     6885583 :     ut_ad(pcur->m_rel_pos == BTR_PCUR_BEFORE);
3420             :   } else {
3421   228477355 :     ut_ad(pcur->m_pos_state == BTR_PCUR_IS_POSITIONED);
3422   228477355 :     ut_ad((pcur->m_rel_pos == BTR_PCUR_ON) == pcur->is_on_user_rec());
3423             :   }
3424             : #endif /* UNIV_DEBUG */
3425             : 
3426             :   /* The position may need be adjusted for rel_pos and moves_up. */
3427             : 
3428   235362937 :   switch (pcur->m_rel_pos) {
3429           0 :     case BTR_PCUR_UNSET:
3430           0 :       ut_d(ut_error);
3431   227646204 :       ut_o(return (true));
3432   227646204 :     case BTR_PCUR_ON:
3433   227646204 :       if (!success && moves_up) {
3434        1311 :       next:
3435        1311 :         pcur->move_to_next(mtr);
3436        1311 :         return true;
3437             :       }
3438             :       return (!success);
3439             :     case BTR_PCUR_AFTER_LAST_IN_TREE:
3440             :     case BTR_PCUR_BEFORE_FIRST_IN_TREE:
3441             :       return true;
3442      830970 :     case BTR_PCUR_AFTER:
3443             :       /* positioned to record after pcur->old_rec. */
3444      830970 :       pcur->m_pos_state = BTR_PCUR_IS_POSITIONED;
3445     7716535 :     prev:
3446     7716535 :       if (pcur->is_on_user_rec() && !moves_up) {
3447          12 :         pcur->move_to_prev(mtr);
3448             :       }
3449             :       return true;
3450     6885735 :     case BTR_PCUR_BEFORE:
3451             :       /* For non optimistic restoration:
3452             :       The position is now set to the record before pcur->old_rec.
3453             : 
3454             :       For optimistic restoration:
3455             :       The position also needs to take the previous search_mode into
3456             :       consideration. */
3457             : 
3458     6885735 :       switch (pcur->m_pos_state) {
3459     6885583 :         case BTR_PCUR_IS_POSITIONED_OPTIMISTIC:
3460     6885583 :           pcur->m_pos_state = BTR_PCUR_IS_POSITIONED;
3461     6885583 :           if (pcur->m_search_mode == PAGE_CUR_GE) {
3462             :             /* Positioned during Greater or Equal search
3463             :             with BTR_PCUR_BEFORE. Optimistic restore to
3464             :             the same record. If scanning for lower then
3465             :             we must move to previous record.
3466             :             This can happen with:
3467             :             HANDLER READ idx a = (const);
3468             :             HANDLER READ idx PREV; */
3469     6885565 :             goto prev;
3470             :           }
3471             :           return true;
3472         152 :         case BTR_PCUR_IS_POSITIONED:
3473         152 :           if (moves_up && pcur->is_on_user_rec()) {
3474           0 :             goto next;
3475             :           }
3476             :           return true;
3477             :         case BTR_PCUR_WAS_POSITIONED:
3478             :         case BTR_PCUR_NOT_POSITIONED:
3479             :           break;
3480             :       }
3481             :   }
```

I am lowering the severity of the bug, because I don't see how it can lead to data-loss/wrong answer.
I am open to re-assesing it, if you can show a more severe scenario.

Posted by developer:

I think we might need to redesign restore_position().
I doubt the current approach where this function has no arguments can be rigurously and meaningfully defined.
Given that the cursor could have pointed either to a USER record or INFIMUM or SUPREMUM and that records could be inserted or deleted and pages could have been merged or split, the caller, IMO, has to hint and what do they really mean by "restoring" a cursor.
For that it might be helpful to look at what the caller is doing with the cursor:
Is the caller going UP or DOWN?
Is the caller planning to USE the record pointed by the cursor, or IGNORE it and move to the next one?
(If this later question seems confusing, consider that sometimes we save+restore the cursor before processing it, yet in other loops we rather do it after processing a record. To see if your loop is one kind or the other, ask yourself what will it do right after the call to restore_position.)

I think our current approach to store_position() is good enough and quite well defined.
Let me just describe how I understand the important parts of it.
The goal of store_position() is to describe in some useful way, where the cursor is currently. We have to do it in a language of BEFORE/ON/AFTER m_old_rec, where m_old_rec is a copy of a value of a record.
Making a copy is necessary as typically latches are released after store_position(), so the state of the index can change and the record pointed by the cursor can get freed.
It's natural what to do if the cursor points to a USER record: just copy it to m_old_rec, and remember the cusor was ON (==) the m_old_rec.
If the cursor was on INFIMUM, it's not very clear what we mean by "cursor position". We would like to express it in terms of record values, but INFIMUM is quite ambiguous - it is some value between the last USER record on previous page and the first USER record on this page, but various people might have various opinions on which one is it exactly (or if it should be modeled as range, should it be open, closed, and where does it start and end exactly).
Thankfully, we don't need to resolve this ambiguity to have a well defined behaviour of store_position().
We simply *define* the behaviour in this case to be: copy first USER record to m_old_rec, and remember the cursor was BEFORE it. This description of the state is truthful, and as we shall see useful enough.
By symmetry, we handle SUPREMUM by copying last USER record to m_old_rec and remembering the cursor was AFTER it.
(For now, let me ignore the "special" cases of empty index, which the current implementation of store_position() handles in particular way)
Also, we save the pointer to the block, the rec, and modification_clock, for Optimistic Restore.

Now, let us define restore_position(bool moves_up, bool will_use) primitive, which the caller can use to restore the cursor while moving UP or DOWN, when they plan to immediately USE or IGNORE the restored cursor position.
The following table describes, what I believe is the correct way of searching for a record, depending on the use case.
The way to read this table is that, for example:
"cursor was on INFIMUM BEFORE m_old_rec, caller moves UP and will USE the record, so we should find a record which is GE than m_old_rec"

cursor was INFIFMUM USER SUPREMUM
BEFORE ON AFTER m_old_rec

caller: we should find:
moves will
UP USE GE GE G
IGNORE L LE LE

DOWN USE L LE LE
IGNORE GE GE G
than m_old_rec

How did I arive at these particular values?
For each scenario, I was imagining 2x2 sub-scenarios:
{the record with m_old_rec was deleted or not} x {the page got merged with its siblings or not}
and asking myself: "Which record I wish the cursor to be positioned to in a given sub-scenario?"
(With the criteria being: make sure that very record which existed before store_position() will be processed exactly once)
Then I found a single comparison operator (GE,G,L,LE) which gives the right answer regardless of which of the 2x2 sub-scenarios happened.
This is because the code in restore_position(moves_up, will_use) only knows the scenario (m_rel_pos, moves_up, will_use), but doesn't know which sub-scenario occurred meanwhile, yet has to decide on search mode (GE,G,L,LE) for the dive.

There are some elegant "symmetries" and "anti-symmetries" in this table which makes me quite confident I am onto something.
These symmetries also make it possible to combine the two boolean arguments into one: moves_up XOR will_use.
But I am not sure what is the intuitive meaning of it (round_down?), and if that would really make it easier to think about.
This area proved very error-prone in the past, and we should aim for clarity - typically it involves describing intent/situation rather then solution/approach, thus I believe (bool moves_up, bool will_use) to be a better interface than (bool round_down).

Above table describes how the "Pesimistic Restore", i.e. one which dives into the B-tree, should behave.
What about "Optimistic Restore", i.e. when the page is still in the Buffer Pool, and modify_clock hasn't changed (no record was deleted from it, perhaps new were inserted)?

There are two approaches to this question.

A) We can insist on the behaviour which matches Pesimistic Restore exactly. This is easy for USER as the column has only GE and LE operators, which permit equality, and m_old_rec==rec, so we don't need to reposition anything.
For the case where cursor was on SUPREMUM, things get tricky. The modify_clock is not counting inserts, so there could be some new records inserted in between m_old_rec and SUPREMUM. If the operator is G, we should restore to smallest of them, not the SUPREMUM. There could be hundreds of such new records, and our rec pointed to the SUPREMUM, so we now need to iterate backwards, until we find the one equal to m_old_rec, and then move one step forward. This can be quite a lot of perhaps costly comparisons. An alternative would be to save not only the m_old_rec's *value*, but also *address*, so that we don't need this costly loop, and can jump right to the record equal to m_old_rec. One way to achieve this technically in code, is to adjust the cursor position by moving it back one step in store_position(). The LE case is similar, just a bit simpler as we don't have to move forward one step: the m_old_rec is equal to itself so satisfies LE.
As for INFIMUM, the story is similar, just in reverse.

B) We could observe that any usage of cursor needs to somehow handle the records which were inserted "in parallel", and typically we just ignore any such insert which happens "behind" (w.r.t. to moving direction) the cursor. In some cases we try to prevent it from happening by placing data locks on gaps, in others we just don't care. AFAICT there is no place in our code right now, which could fail to produce correct results, if we simply ignore the records inserted "behind" a cursor which was placed at SUPREMUM. Therfore, one - IMO valid - approach to Optimisitic Restore is to just keep the cursor position as it was before store_position()

I have no strong opinion. I kinda like the brutal simplicity of "Optimistic Restore" in B, as well as the elegance of predictable behaviour of A.

Above approach assumes there's no "post-processing" done by the callers of store_position() and restore_position(moves_up, will_use).
In our current codebase this is not true.
There are callers who manually modify m_rel_pos after store_position().
There are callers, who try do some post-processing after restore_position() based on things like m_rec_pos, "success" or "optimism" of the operation.
As we saw, this is very error-prone.
I think getting rid of that would be a good thing.

There's one elephant in the room though: not all search modes are supported for cursors at non-leaf level.
```
/* Currently, PAGE_CUR_LE is the only search mode used for searches
ending to upper levels */

ut_ad(level == 0 || mode == PAGE_CUR_LE || RTREE_SEARCH_MODE(mode));
```
Thankfully, there are not many usages of persistent cursors at non-leaf levels - AFAIU there's just one: HISTOGRAM sampler.
This particular use case moves UP, will IGNORE the cursor, and calls store_position() only when the cursor is on SUPREMUM.
(Moreover, it doesn't really care too much about skipping or processing something twice, as it is probabilistic sampler anyway)
This means we only need LE mode, which is good news.

If need arises, we could try to "emulate" GE and G modes, by first using LE and then adjusting the position.
But this is a bit tricky, as we may need to move to the next page, or accept that the cursor points to SUPREMUM instead of a USER record with a given property.
For example when we need to find the samllest record which is Greater than 7, we might start by searching for the largest record Lower-or-Equal to 7, which could, be
a) USER record <7, which implies 7 is deleted
b) USER record 7
c) INFIMUM (right before smallest record which is Greater than 7, or index is empty), (which AFAIU implies 7 is deleted, but I must check my notes)
Then step one record forward within the page, which could end up respectively on
a) the smallest USER record which is Greater than 7 or SUPREMUM right before it
b) the smallest USER record which is Greater than 7 or SUPREMUM right before it
c) the smallest USER record which is Greater than 7 or SUPREMUM (if index is empty)

The "emulation" for GE is similar, except we don't take step forward in (b).

Note that ending up on SUPREMUM in case of G or GE, is already a typical behaviour for leaf-level cursors, and typically handled without any problem - no matter if the caller plans to USE or IGNORE the cursor, there's typically not much one can "do" to "use" a SUPRERMUM record, so the caller will probably move to the next one anyway.

I don't think there's an easy way to emulate L, because LE can sometimes land on INFIMUM, in which case we would have to move to previous sibling, which is very tricky business due to latching order (deadlock avoidance).

Anyway, for now, all that we need for non-leaf levels seems to be LE, and we can simply assert that nothing else is used.

There's a different perspective on all that, in which the claim from the reporter makes perfect sense:
The restore_position() function already has a well specified contract in its doxygen:
```
 /** Restores the stored position of a persistent cursor bufferfixing
 the page and obtaining the specified latches. If the cursor position
 was saved when the
 (1) cursor was positioned on a user record: this function restores
 the position to the last record LESS OR EQUAL to the stored record;
 (2) cursor was positioned on a page infimum record: restores the
 position to the last record LESS than the user record which was the
 successor of the page infimum;
 (3) cursor was positioned on the page supremum: restores to the first
 record GREATER than the user record which was the predecessor of the
 supremum.
 (4) cursor was positioned before the first or after the last in an
 empty tree: restores to before first or after the last in the tree.
 @param[in]        latch_mode  BTR_SEARCH_LEAF, ...
 @param[in,out]  mtr                 Mini-transaction
 @param[in]        location            Location where called.
 @return true if the cursor position was stored when it was on a user
         record and it can be restored on a user record whose ordering
         fields are identical to the ones of the original user record */
```
I find this contract not very useful, because when this function returns `false` it is not clear to the caller, if the record which the cursor points to now is to be processed or not, because that might depend on whether or not at the moment of saving the position the cursor pointed to SUPREUM or USER record. If it was on SUPREMUM (and Pessimistic Restoration occurred) then the cursor points to the first record greater than the saved predecessor, so it points to something which the caller certainly has not seen nor processed yet. OTOH if it was on a USER record, then it now points one position before it (false being returned means we couldn't find a row exactly equal to the saved one) which is either a record the caller already has seen&processed, or some newly inserted record which the caller should ignore. Either way, the caller shouldn't process this record.
Thus, the only way to use this API is to either:
(a) assert success, as many callers apparently do - as in many places in our code we are sure that at the moment of saving the cursor it was pointing to a USER record *and* that nobody could have removed it since then, because we hold some form of data lock on it
(b) have to (re)implement a boilerplate switch(m_rel_pos) which tries to recall what was the situation when saving, to make appropriate adjustments, which as the reporter has noticed, doesn't really work correctly

Still, this is the contract we've promised, and this bug report clearly shows that:
1. we violate this contract in case of Optimistic Restore
2. the logic in callers assumes that m_rel_pos is not modified by restore_position, even though this doesn't seem to be *clearly* stated in the contract (it is somewhat gestured at in the doxygen for m_rel_pos) 

In this light, I am sympathetic to the idea of at least making sure the code in restore_position() satisfies the contract from its own doxygen.
Such change would hopefully be a bit smaller in scope, and hopefully easier to test/convince ourselves of not introducing regressions, as basically the change would simply ensure that some of edge cases behave more in line with the typical cases which the code already knows how to handle.
I think the contributed patch is indeed a good effort in this direction, but I have not verified it rigorously yet.

One thing I am not sure yet is why exactly restore_position() calls store_position(mtr). The code comment says:
```
  /* We have to store new position information, modify_clock etc.,
  to the cursor because it can now be on a different page, the record
  under it may have been removed, etc. */
```
but it is not clear at all to me why any of that is important. Where exactly do we need the m_old_stored, m_old_rec, m_block_when_stored or m_modify_clock to be updated?
Thus perhaps instead of the proposed
```
+  auto old_rel_pos = m_rel_pos;
   store_position(mtr);
+  m_rel_pos = old_rel_pos;
```
we should rather consider removing this call to store_position(mtr) entirely? 

Anyway, I now think that the way forward with this mess is to first apply this contribution patch, to make the behaviour match the spec, then, as a separate effort, try to come up with a less error-prone API.

Once again, thanks for great work!

Hi Jakub, the call to store_position() at the end of a pessimistic restore_position() appears necessary primarily to update the m_rel_pos member. 

During a pessimistic restore, the B-tree search might land the cursor in a position relative to the original m_old_rec that differs from the state indicated by m_rel_pos before the restore (e.g., finding a user record for infimum record and thus changing the state from BTR_PCUR_BEFORE to BTR_PCUR_ON). Since calling functions like sel_restore_position_for_mysql inspect m_rel_pos after the restore to make further adjustments, store_position() ensures m_rel_pos accurately reflects the actual outcome of the restoration attempt, preventing incorrect behavior based on potentially stale state information. I do agree with you that the comments 'We have to store new position information, modify_clock etc.' seems to make no sense.

Hello Yichang SONG,

> Hi Jakub, the call to store_position() at the end of a pessimistic restore_position() appears necessary primarily to update the m_rel_pos member. 

Note that the proposed contribution patch does precisely the opposite: in case of Pessimistic Restore updates all the other fields, but keeps the m_rel_pos as it was. 

I am worried that this is not a correct thing to do, because:
1. calling store_position() from restore_position() seems to only make sense under assumption that some callers expect the restore_position() to leave the cursor in "m_old_stored=true" state, so that they can call restore_position() again
2. but under the very same assumption, "desynchronizing" `m_rel_pos` with `m_old_rec` is a pretty bad idea, because if the caller indeed calls restore_position() twice, they will end up on a different record than they wanted

So, I am now pondering if a better patch would be something where we keep the original:
```
  /* We have to store new position information, modify_clock etc.,
  to the cursor because it can now be on a different page, the record
  under it may have been removed, etc. */

  store_position(mtr);

  return (false);
```
and instead fix the logic in `sel_restore_position_for_mysql` to not assume that `m_rel_pos` is not modified by the call to `restore_position`, and instead save the old value of this field first in a local variable.

There's one more issue with the contribution patch which makes me worried: it causes the store_position() to have a side effect of nudging the cursor if it is on INFIMUM or SUPREMUM, which is a bit risky when combined with the fact we call store_position() at the end of restore_position().

It's complicated.

Posted by developer:
 
I've tried to apply the contribution patch to mysql-trunk 
It seems to fail multiple tests (some of them non-deterministically), for example:
binlog.binlog_grant_alter_user encryption.alter_rename_table i_innodb.innodb_bug14621190_debug_sync i_main.acl-bug23295423  innodb.innodb_64k innodb.innodb_multi_update innodb.instant_ddl_import_charset_drop_enum_compact innodb.instant_ddl_import_charset_drop_enum_dynamic innodb.instant_ddl_import_charset_drop_enum_redundant innodb_gis.alter_table_rebuild_missing_record innodb_gis.rtree_multi_pk main.alter_table main.alter_table_myisam main.delete main.fix_priv_tables main.foreign_key_debug main.gis main.information_schema_statistics main.sp main.subquery_sj_all_bka_nobnl main.system_tables_myisam_lctn_1 main.type_enum parts.partition-dml-1-1-innodb  rpl.rpl_innodb_bug28430 
  sys_vars.updatable_views_with_limit_func x.admin_list_objects x.admin_list_objects_docpath x.admin_no_backslash_escapes x.collection_compatibility x.delete_sql_o

Some are crashing, some are reporting same row multiple times:
```
[ 75%] parts.partition-dml-1-1-innodb           w1  [ fail ]
        Test ended at 2025-04-29 13:01:55

CURRENT_TEST: parts.partition-dml-1-1-innodb
--- /export/home/tmp/jlopusza/mysql/mysql-test/suite/parts/r/partition-dml-1-1-innodb.result    2025-04-11 11:55:36.434319779 +0300
+++ /export/home/tmp/jlopusza/mysql-debug/mysql-test/var/1/log/partition-dml-1-1-innodb.reject  2025-04-29 14:01:54.512039678 +0300
@@ -179,6 +179,7 @@
 21     subp3-upd
 22     subp4-upd
 23     subp5-upd
+23     subp5-upd
 24     subp3-upd
 3      subp3-upd
 4      subp4-upd

mysqltest: Result content mismatch
```

The patch did not apply cleanly as we have recently changed some methods of PCursor, but I doubt this is because of this.

Hello Jakub, thank you for your thorough reply! I've been on vacation in the last few days, sorry for replying so late.

Actually, I wonder as much as you why 'restore_position' has to call 'store_position' at the end. But since I'm not sure if I can remove the call here, and the 'sel_restore_position_for_mysql' assume 'm_rel_pos' is not modified, so I keep the 'm_rel_pos' unchanged in the 'restore_position'.

This patch does result in 'restore_position' being called twice and then the cursor is misplaced, so a more elegant approach might be to consider whether we can remove the call to 'store_position'.

After discussing internally with Kuba, changing status to Sev3.

regards,
Umesh