Bug #118983 PCursor::move_to_next_block is not functioning as expected
Submitted: 10 Sep 6:59 Modified: 11 Sep 13:03
Reporter: Ruyi Zhang (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S3 (Non-critical)
Version:8.0.43 OS:Any
Assigned to: CPU Architecture:Any
Tags: histogram, Parallel-Scan

[10 Sep 6:59] Ruyi Zhang
Description:
Recently, we have been using Parallel Scan (`Parallel_reader::add_scan`) to scan pages at level=1, in order to implement certain logic for prefetching certain pages. 

However, we noticed that when scanning pages via PCursor, PCursor::move_to_next_block does not seem to function as expected, especially when scanning levels other than 0.

These unexpected issues are mainly as follows:

1. We noticed that both `move_to_next_block_at_leaf_level` and `move_to_next_block_at_level_one` internally check for lock waits on the current index. When a wait is detected, they release all locks in `mtr_t` and save the cursor. 

For `move_to_next_block_at_leaf_level`, since `index->lock` is not held either externally or internally, the checking logic appears redundant. 

For `move_to_next_block_at_level_one`, although it holds `index->lock` in S-mode externally, releasing the locks managed by `mtr_t` internally fails to release the lock on `index` as expected, making the checking logic equally ineffective.

2. In `move_to_next_block_at_level_one`, if a lock wait is detected on the index or the lock for the next_block cannot be immediately acquired due to lock ordering reasons, it will save the cursor and restore it after releasing the latch. 

If the cursor cannot be optimistically restored, it will restore the cursor via `btr_cur_search_to_nth_level`. 

The problem is that `btr_cur_search_to_nth_level` always prioritizes matching AHI and ignores the specified level. 

This causes an incorrect restore of the cursor at level=0 when attempting to restore it at level=1.

3. In `move_to_next_block_at_level_one`, the `next_block` is accessed using the no-wait locking method to avoid deadlocks caused by violating the lock ordering. 

However, this improvement still seems to have issues: although `next_block` is accessed in `RW_NO_LATCH` mode, if `next_block` is in the `IO_READ` state, an S-lock will still be introduced in `buf_wait_for_read`. 

If `next_block` has been X-locked by another thread performing SMO and is accessing the current block, a deadlock may still occur.

==========================================================

Additionally, I'd like to ask a question: Since `move_to_next_block_at_level_one` has used the `NO_WAIT` mode to avoid issues caused by lock ordering, is the S-mode `index->lock` held by the outer layer still necessary?

How to repeat:
For 2, this MTR can reproduce the issue

@@ -573,7 +576,9 @@ dberr_t PCursor::move_to_next_block_at_level_one(dict_index_t *index) {
   ut_ad(m_read_level != 0);
   dberr_t err = DB_SUCCESS;
 
-  if (rw_lock_get_waiters(dict_index_get_lock(index))) {
+  if (DBUG_EVALUATE_IF("pcursor_move_to_next_block_one_level_release_latches",
+                       true, false) ||
+      rw_lock_get_waiters(dict_index_get_lock(index))) {
     /* There are waiters on the index tree lock. Store and restore
     the cursor position, and yield so that scanning a large table
     will not starve other threads. */
@@ -583,6 +588,8 @@ dberr_t PCursor::move_to_next_block_at_level_one(dict_index_t *index) {
 
     save_level_one_progress();
 
+    DEBUG_SYNC_C("pcursor_move_to_next_block_level_one_latches_released");
+
     /* Yield so that another thread can proceed. */
     std::this_thread::yield();
 

```
# This test file verifies the error where AHI always returns leaf nodes
# regardless of the search level when move_to_next_block_at_level_one restore cursor pessimistically.

connect  con1, localhost, root,,;

--connection default
create table a(id varbinary(3000) primary key, val varbinary(300));
insert into a values (repeat('a', 3000), repeat('x', 300));
insert into a values (repeat('b', 3000), repeat('x', 300));
insert into a values (repeat('c', 3000), repeat('x', 300));
insert into a values (repeat('d', 3000), repeat('x', 300));

insert into a values (repeat('e', 3000), repeat('x', 300));
insert into a values (repeat('f', 3000), repeat('x', 300));
insert into a values (repeat('g', 3000), repeat('x', 300));
insert into a values (repeat('h', 3000), repeat('x', 300));
insert into a values (repeat('i', 3000), repeat('x', 300));
insert into a values (repeat('j', 3000), repeat('x', 300));
insert into a values (repeat('k', 3000), repeat('x', 300));
insert into a values (repeat('l', 3000), repeat('x', 300));
insert into a values (repeat('m', 3000), repeat('x', 300));
insert into a values (repeat('n', 3000), repeat('x', 300));

begin;
insert into a values (repeat('o', 3000), repeat('x', 300));
insert into a values (repeat('p', 3000), repeat('x', 300));
insert into a values (repeat('q', 3000), repeat('x', 300));
insert into a values (repeat('r', 3000), repeat('x', 300));
insert into a values (repeat('s', 3000), repeat('x', 300));

# The BTree is:
#                                           Root(4)recs=[a,g]
#                                             /             \
#                        Level=1 Page(11)recs=[a,c]      Page(12)recs=[g,k,o,s]
#                                                I(cursor with level=1)
#                        Level=0 ......

# Make the key key=repeat('c', 3000) cached by AHI
--disable_query_log
let $i = 0;
while ($i < 200) {
  --disable_query_log
  select length(id) from a where id=repeat('d',3000);
  inc $i;
}
--enable_query_log

--connection con1
# move_to_next_block_at_level_one will save the cursor at key=repeat('c', 3000) and restore it after the subsequent rollback is completed.
# However, the rollback prevents the cursor from being optimistically restored.
# When restoring via btr_cur_search_to_nth_level, the AHI is incorrectly used, causing the level to change from non-zero to zero.

SET GLOBAL DEBUG="+d,pcursor_move_to_next_block_one_level_release_latches";
SET DEBUG_SYNC="pcursor_move_to_next_block_level_one_latches_released SIGNAL latch_released WAIT_FOR continue";
send ANALYZE TABLE a UPDATE HISTOGRAM ON val;

--connection default
SET DEBUG_SYNC="now WAIT_FOR latch_released";
rollback;
# After Rollback, The BTree is:
#                                           Root(4)recs=[a]
#                                            /           \
#                        Level=0 ......

SET DEBUG_SYNC="now SIGNAL continue";

--connection con1
reap;
SET GLOBAL DEBUG="-d,pcursor_move_to_next_block_one_level_release_latches";
drop table a;
```

Suggested fix:
For 2:
`btr_cur_search_to_nth_level` retrieves AHI **only when level=0**.
[10 Sep 7:19] Ruyi Zhang
Update the reproducible patch

diff --git a/storage/innobase/row/row0pread.cc b/storage/innobase/row/row0pread.cc
index 43dd5e5188b..82dd77f19b5 100644
--- a/storage/innobase/row/row0pread.cc
+++ b/storage/innobase/row/row0pread.cc
@@ -381,6 +381,9 @@ dberr_t PCursor::restore_level_one_progress() noexcept {
 
   restore_to_largest_le_position_saved();
 
+  ut_ad(btr_page_get_level(buf_block_get_frame(get_page_cursor()->block)) ==
+        m_read_level);
+
   /* Move to the next record on the page */
   page_cur_move_to_next(get_page_cursor());
 
@@ -573,7 +576,9 @@ dberr_t PCursor::move_to_next_block_at_level_one(dict_index_t *index) {
   ut_ad(m_read_level != 0);
   dberr_t err = DB_SUCCESS;
 
-  if (rw_lock_get_waiters(dict_index_get_lock(index))) {
+  if (DBUG_EVALUATE_IF("pcursor_move_to_next_block_one_level_release_latches",
+                       true, false) ||
+      rw_lock_get_waiters(dict_index_get_lock(index))) {
     /* There are waiters on the index tree lock. Store and restore
     the cursor position, and yield so that scanning a large table
     will not starve other threads. */
@@ -583,6 +588,8 @@ dberr_t PCursor::move_to_next_block_at_level_one(dict_index_t *index) {
 
     save_level_one_progress();
 
+    DEBUG_SYNC_C("pcursor_move_to_next_block_level_one_latches_released");
+
     /* Yield so that another thread can proceed. */
[11 Sep 13:03] MySQL Verification Team
Thanks for the report and test case.