Bug #114809 inconsistent page no in btr_pcur_t's move_to_next_page
Submitted: 29 Apr 7:35 Modified: 29 Apr 10:20
Reporter: biao li Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S3 (Non-critical)
Version:8.0.32 OS:Any
Assigned to: CPU Architecture:Any

[29 Apr 7:35] biao li
Description:
InnoDB abort during query : SELECT * FROM tt_91_p PARTITION (p2) WHERE i1 = 439071

2024-03-17T04:02:20.688490Z 80 [ERROR] [MY-013183] [InnoDB] Assertion failure: btr0pcur.cc:336:btr_page_get_prev(next_page, mtr) == get_block()->page.id.page_no() thread 140103323744000
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/8.0/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
2024-03-17T04:02:20Z UTC - mysqld got signal 6 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
BuildID[sha1]=3b222c127b490c4851419d93b262a041f58a652e
Thread pointer: 0x7f6cca0b5800
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f6c58d7d2d0 thread_stack 0x100000
/data/percona/sql/bin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x5605054e8d7d]
/data/percona/sql/bin/mysqld(print_fatal_signal(int)+0x3b3) [0x56050412b333]
/data/percona/sql/bin/mysqld(my_server_abort()+0x66) [0x56050412b436]
/data/percona/sql/bin/mysqld(my_abort()+0xa) [0x5605054e184a]
/data/percona/sql/bin/mysqld(ut_dbg_assertion_failed(char const*, char const*, unsigned long)+0x323) [0x56050588a763]
/data/percona/sql/bin/mysqld(btr_pcur_t::move_to_next_page(mtr_t*)+0x1d0) [0x5605058e7590]
/data/percona/sql/bin/mysqld(row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long)+0x13bf) [0x5605057e558f]
/data/percona/sql/bin/mysqld(ha_innobase::general_fetch(unsigned char*, unsigned int, unsigned int)+0x298) [0x560505632448]
/data/percona/sql/bin/mysqld(non-virtual thunk to ha_innopart::rnd_next_in_part(unsigned int, unsigned char*)+0xef) [0x5605056421cf]
/data/percona/sql/bin/mysqld(Partition_helper::ph_rnd_next(unsigned char*)+0x3b) [0x560503e6f6cb]
/data/percona/sql/bin/mysqld(handler::ha_rnd_next(unsigned char*)+0x62) [0x560504262252]
/data/percona/sql/bin/mysqld(TableScanIterator::Read()+0xd1) [0x5605043e83d1]
/data/percona/sql/bin/mysqld(FilterIterator::Read()+0x14) [0x560504587694]
/data/percona/sql/bin/mysqld(Query_expression::ExecuteIteratorQuery(THD*)+0x2b3) [0x56050407c473]
/data/percona/sql/bin/mysqld(Query_expression::execute(THD*)+0x2c) [0x56050407c7fc]
/data/percona/sql/bin/mysqld(Sql_cmd_dml::execute(THD*)+0x25f) [0x560503ffc8af]
/data/percona/sql/bin/mysqld(mysql_execute_command(THD*, bool)+0xce6) [0x560503f8cf36]
/data/percona/sql/bin/mysqld(dispatch_sql_command(THD*, Parser_state*)+0x440) [0x560503f91f50]
/data/percona/sql/bin/mysqld(dispatch_command(THD*, COM_DATA const*, enum_server_command)+0xf9b) [0x560503f934cb]
/data/percona/sql/bin/mysqld(do_command(THD*)+0x207) [0x560503f95367]
/data/percona/sql/bin/mysqld(threadpool_process_request(THD*)+0xcf) [0x560505f1ea8f]
/data/percona/sql/bin/mysqld(+0x3267795) [0x560505f20795]
/data/percona/sql/bin/mysqld(+0x31ae895) [0x560505e67895]
/usr/lib64/libpthread.so.0(+0x8f4b) [0x7f6cdd8fbf4b]
/usr/lib64/libc.so.6(clone+0x3f) [0x7f6cdd3e3b8f]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7f6c3c59a7b8): SELECT * FROM tt_91_p PARTITION (p2) WHERE i1 = 439071
Connection ID (thread ID): 80
Status: NOT_KILLED

How to repeat:
Unable to reproduce stably. maybe related to compression page.

Suggested fix:
When I debug the coredump file, found some strange things:

#8  0x000056050588a763 in ut_dbg_assertion_failed (expr=expr@entry=0x560506dfee98 "btr_page_get_prev(next_page, mtr) == get_block()->page.id.page_no()", file=file@entry=0x560506dfee38 "/ci/workspace/hwsql/storage/innobase/btr/btr0pcur.cc", line=line@entry=336)
    at /ci/workspace/hwsql/storage/innobase/ut/ut0dbg.cc:99
#9  0x00005605058e7590 in btr_pcur_t::move_to_next_page (this=this@entry=0x7f6c170fd2c8, mtr=mtr@entry=0x7f6c58d7af60) at /ci/workspace/hwsql/storage/innobase/btr/btr0pcur.cc:336
#10 0x00005605057e558f in btr_pcur_t::move_to_next (mtr=0x7f6c58d7af60, this=0x7f6c170fd2c8) at /ci/workspace/hwsql/storage/innobase/include/btr0pcur.h:870
#11 btr_pcur_t::move_to_next (mtr=0x7f6c58d7af60, this=0x7f6c170fd2c8) at /ci/workspace/hwsql/storage/innobase/include/btr0pcur.h:859
#12 row_search_mvcc (buf=buf@entry=0x7f6c25645138 "\360^I\004", mode=PAGE_CUR_G, mode@entry=PAGE_CUR_UNSUPP, prebuilt=0x7f6c170fd088, match_mode=match_mode@entry=0, direction=direction@entry=1) at /ci/workspace/hwsql/storage/innobase/row/row0sel.cc:5831
#13 0x0000560505632448 in ha_innobase::general_fetch (this=0x7f6c718b4038, buf=0x7f6c25645138 "\360^I\004", direction=1, match_mode=0) at /ci/workspace/hwsql/storage/innobase/handler/ha_innodb.cc:10721
#14 0x00005605056421cf in non-virtual thunk to ha_innopart::rnd_next_in_part(unsigned int, unsigned char*) () at /ci/workspace/hwsql/storage/innobase/que/que0que.cc:1076
#15 0x0000560503e6f6cb in Partition_helper::ph_rnd_next (this=0x7f6c718b58f0, buf=0x7f6c25645138 "\360^I\004") at /ci/workspace/hwsql/sql/partitioning/partition_handler.cc:1550
#16 0x0000560504262252 in handler::ha_rnd_next (this=0x7f6c718b4038, buf=0x7f6c25645138 "\360^I\004") at /ci/workspace/hwsql/sql/handler.cc:2993
#17 0x00005605043e83d1 in TableScanIterator::Read (this=0x7f6c33fb85e8) at /ci/workspace/hwsql/sql/iterators/row_iterator.h:245
#18 0x0000560504587694 in FilterIterator::Read (this=0x7f6c33fb8630) at /opt/hw/gcc-10.3/include/c++/10/bits/unique_ptr.h:421
#19 0x000056050407c473 in Query_expression::ExecuteIteratorQuery (this=0x7f6c3c59a828, thd=0x7f6cca0b5800) at /opt/hw/gcc-10.3/include/c++/10/bits/unique_ptr.h:421
#20 0x000056050407c7fc in Query_expression::execute (this=<optimized out>, thd=0x7f6cca0b5800) at /ci/workspace/hwsql/sql/sql_union.cc:1823
#21 0x0000560503ffc8af in Sql_cmd_dml::execute (this=0x7f6c3c59dc80, thd=0x7f6cca0b5800) at /ci/workspace/hwsql/sql/sql_select.cc:583
#22 0x0000560503f8cf36 in mysql_execute_command (thd=0x7f6cca0b5800, first_level=<optimized out>) at /ci/workspace/hwsql/sql/sql_parse.cc:5104
#23 0x0000560503f91f50 in dispatch_sql_command (thd=0x7f6cca0b5800, parser_state=<optimized out>) at /ci/workspace/hwsql/sql/sql_parse.cc:5756
#24 0x0000560503f934cb in dispatch_command (thd=0x7f6cca0b5800, com_data=<optimized out>, command=COM_QUERY) at /ci/workspace/hwsql/sql/sql_parse.cc:2251
#25 0x0000560503f95367 in do_command (thd=thd@entry=0x7f6cca0b5800) at /ci/workspace/hwsql/sql/sql_parse.cc:1475

in frame 9, we can see, the next block is valid, but the current block is a compressed block.

(gdb) p *next_block
$69 = {page = {id = {m_space = 2345, m_page_no = 13}, size = {m_physical = 16384, m_logical = 16384, m_is_compressed = 0}, buf_fix_count = {<std::atomic<unsigned int>> = {<std::__atomic_base<unsigned int>> = {static _S_alignment = 4, _M_i = 1}, ...}}

current block is a compressed block.

(gdb) p *m_btr_cur->page_cur->block
$73 = {page = {id = {m_space = 51, m_page_no = 75}, size = {m_physical = 2048, m_logical = 16384, m_is_compressed = 1}, buf_fix_count = {<std::atomic<unsigned int>> = {<std::__atomic_base<unsigned int>> = {static _S_alignment = 4, _M_i = 1}, ...}

but from the m_block_when_stored, we can see current block should be (2345, 14)

(gdb) p *this
$75 = {m_btr_cur = {index = 0x7f6c9a87f488, page_cur = {index = 0x0, rec = 0x7f6cbca90070 "supremum\354\201\003", offsets = 0x0, block = 0x7f6cb8f4b100}, purge_node = 0x0, left_block = 0x0, thr = 0x7f6c170fda88, flag = BTR_CUR_BINARY, tree_height = 2, up_match = 0,
    up_bytes = 2, low_match = 0, low_bytes = 2, ahi = {prefix_info = {n_bytes = 0, n_fields = 0, left_side = false}, ahi_hash_value = 0}, path_arr = 0x0, rtr_info = 0x0, m_own_rtr_info = false, m_fetch_mode = Page_fetch::NORMAL}, m_latch_mode = 1, m_old_stored = false,
  m_old_rec = 0x7f6c8db41236 "\200\004I^\200\001\344lY", m_old_n_fields = 2, m_rel_pos = BTR_PCUR_ON, m_block_when_stored = {m_block = 0x7f6cb8f4b100, m_page_id = {m_space = 2345, m_page_no = 14}}, m_modify_clock = 150, m_pos_state = BTR_PCUR_IS_POSITIONED,
  m_search_mode = PAGE_CUR_G, m_trx_if_known = 0x0, m_old_rec_buf = 0x7f6c8db41230 "", m_buf_size = 14, m_read_level = 0, import_ctx = 0x0}

I get the page (2345, 14) from buffer pool, it's a valid page, but it's buf_fix_count is zero.

(gdb) p *(buf_block_t *) 0x7f6cb8fa3a60
$77 = {page = {id = {m_space = 2345, m_page_no = 14}, size = {m_physical = 16384, m_logical = 16384, m_is_compressed = 0}, buf_fix_count = {<std::atomic<unsigned int>> = {<std::__atomic_base<unsigned int>> = {static _S_alignment = 4, _M_i = 0}, ...}

space_id in page(2345, 14)
(gdb) mach_read_from_4 0x7f6cbd9f8022
2345
(gdb) mach_read_from_4 0x7f6cbd9f8004
14
next page no in page(2345, 14)
(gdb) mach_read_from_4 0x7f6cbd9f800c
13

Continue see content in mtr's stack

(gdb) p  *(mtr_memo_slot_t *)(mtr->m_impl.m_memo.m_first_block.m_data)
$63 = {object = 0x7f6cb8f4b100, type = 1}
(gdb) p  *(mtr_memo_slot_t *)(mtr->m_impl.m_memo.m_first_block.m_data+16)
$64 = {object = 0x7f6cb8fec270, type = 1}
(gdb) p (buf_block_t *)0x7f6cb8f4b100
$65 = (buf_block_t *) 0x7f6cb8f4b100
(gdb) p *(buf_block_t *)0x7f6cb8f4b100
$66 = {page = {id = {m_space = 51, m_page_no = 75}, size = {m_physical = 2048, m_logical = 16384, m_is_compressed = 1}, buf_fix_count = {<std::atomic<unsigned int>> = {<std::__atomic_base<unsigned int>> = {static _S_alignment = 4, _M_i = 1}, ...}
(gdb) p *(buf_block_t *)0x7f6cb8fec270
$67 = {page = {id = {m_space = 2345, m_page_no = 13}, size = {m_physical = 16384, m_logical = 16384, m_is_compressed = 0}, buf_fix_count = {<std::atomic<unsigned int>> = {<std::__atomic_base<unsigned int>> = {static _S_alignment = 4, _M_i = 1},...}

we can see the mtr fixed an unexpected page. Now, I'm a little suspect that compressed page corrupted the other page in the buffer pool.
[29 Apr 7:46] biao li
the page on disk is not corrupted, when restart mysqld and execute the same query, it's ok.
[29 Apr 10:20] MySQL Verification Team
Hi Mr. Li,

Thank you so much for the bug report.

However, this is a forum for the bug with repeatable test cases. These test cases should consist of the set of SQL statements that always lead to the problem that is reported.

You have not supplied us with such a test case.

However, we get many reports like yours and they are all caused by the problems that are unrelated to our products.

Simply, commodity hardware has many transient errors , especially in RAM, so you get crashes in different places in code. These crashes are mostly caused by some RAM module changing a value for a single bit, for a very, very short period of time.

These errors are transient and due to that fact, they can NOT be diagnosed with hardware checking tools.

That is why we suggested that all production installation use ECC RAM, with 2 bits checking and 1 bit correcting.

Can't repeat.