| Bug #90587 | InnoDB: Assertion failure in thread 47255404480256 in file buf0lru.cc line 2213 | ||
|---|---|---|---|
| Submitted: | 24 Apr 2018 2:37 | Modified: | 10 Feb 2020 13:05 |
| Reporter: | raolh rao | Email Updates: | |
| Status: | Can't repeat | Impact on me: | |
| Category: | MySQL Server: InnoDB storage engine | Severity: | S3 (Non-critical) |
| Version: | 5.7.21 | OS: | Any |
| Assigned to: | CPU Architecture: | Any | |
[24 Apr 2018 15:32]
MySQL Verification Team
Hi, Thank you for your bug report. We do have several additional questions for you .... * This looks more like an assert then a simple crash. Can you check it out * Could you try to repeat this with debug binary as we are missing lot's of data ?? * Do you use data encryption for InnoDB tables ?? * Do you use compression for the same table ?? * Could you be more specific as to how could that race condition occur ??? * This could be simple glitch in RAM, since thread number is likely impossible to reach. What was the uptime that would make server reach such a huge number ... * Are you using our binaries ??? * Could you locate the DELETE that caused this ??? * If you could repeat this with debug binary and core dumping enabled, we could find out how could this occur ....... Many thanks in advance ......
[25 May 2018 1:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".
[10 Dec 2019 15:35]
Shu Lin
I hit the same issue on MySQL 8.0.18. This is the commit:
commit 91a17cedb1ee880fe7915fb14cfd74c04e8d6588
Author: mysql-builder@oracle.com <>
Date: Fri Sep 20 10:09:56 2019 +0200
The assert ut_a(bpage->buf_fix_count == 0) fails:
1983 static bool buf_LRU_block_remove_hashed(buf_page_t *bpage, bool zip,
1984 bool ignore_content) {
1985 const buf_page_t *hashed_bpage;
1986 buf_pool_t *buf_pool = buf_pool_from_bpage(bpage);
1987 rw_lock_t *hash_lock;
1988
1989 ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
1990 ut_ad(mutex_own(buf_page_get_mutex(bpage)));
1991
1992 hash_lock = buf_page_hash_lock_get(buf_pool, bpage->id);
1993
1994 ut_ad(rw_lock_own(hash_lock, RW_LOCK_X));
1995
1996 ut_a(buf_page_get_io_fix(bpage) == BUF_IO_NONE);
1997 ut_a(bpage->buf_fix_count == 0);
The code path is the following:
buf_read_page_low->buf_page_io_complete->buf_read_page_handle_error->buf_LRU_free_one_page->buf_LRU_block_remove_hashed.
The assert fails because there is another thread wants to read the same page and put a fix_count on the page
3453 dberr_t Buf_fetch_normal::get(buf_block_t *&block) {
3454 /* Keep this path as simple as possible. */
3455 for (;;) {
3456 /* Lookup the page in the page hash. If it doesn't exist in the
3457 buffer pool then try and read it in from disk. */
3458
3459 block = lookup();
3460
3461 if (block != nullptr) {
3462 buf_block_fix(block);
3463
3464 /* Now safe to release page_hash S lock. */
3465 rw_lock_s_unlock(m_hash_lock);
3466 break;
3467 }
3468
3469 /* Page not in buf_pool: needs to be read from file */
3470 read_page();
3471 }
How to repeat:
Short of encouting an actual page corruption, you will need to use debugger to force the code into buf_page_io_complete->buf_read_page_handle_error
When thread A finishes reading a page and enters buf_page_io_complete(), use debugger to force "is_corrupted" to true. Suspend thread A
5182 /* From version 3.23.38 up we store the page checksum
5183 to the 4 first bytes of the page end lsn field */
5184 bool is_corrupted;
5185 {
5186 BlockReporter reporter =
5187 BlockReporter(true, frame, bpage->size,
5188 fsp_is_checksum_disabled(bpage->id.space()));
5189 is_corrupted = reporter.is_corrupted();
5190 }
5191
5192 if (compressed_page || is_corrupted) {
Let thread B read the same page, and suspect thread B after it adds the fix_count in Buf_fetch_normal::get()
Let thread A continue, it will hit the assert failure.
[10 Dec 2019 16:21]
MySQL Verification Team
We can not accept changing values of variables in the runtime through debugger. That way you can prove anything.
[10 Dec 2019 17:02]
Shu Lin
If you don't want to force the value in debugger, then you can instead try getting into buf_page_io_complete->buf_read_page_handle_error() code path using other methods. For example, intentionly corrupt the page on disk.
[11 Dec 2019 13:39]
MySQL Verification Team
Hi, Still no good enough. Intentionally corrupting the page would only prove that you can force MySQL server to assert if you edit the file in hex editor or similar. We need a fully repeatable test case, which entails a set of SQL statements that always lead to the reported behaviour.
[8 Feb 2020 16:43]
MySQL Verification Team
if you encounter this assertion, can anybody remember if they had large-pages enabled for MySQL ?

Description: When server crash happened, functions call stack is like this: /u01/mysql/bin/mysqld[0x7733bd] /u01/mysql/bin/mysqld[0x113c302] /u01/mysql/bin/mysqld(_Z21buf_LRU_free_one_pageP10buf_page_t+0x17b)[0x113ebab] /u01/mysql/bin/mysqld[0x1145576] /u01/mysql/bin/mysqld(_Z13buf_read_pageRK9page_id_tRK11page_size_tP5trx_t+0x2d3)[0x11476c3] /u01/mysql/bin/mysqld(_Z16buf_page_get_genRK9page_id_tRK11page_size_tmP11buf_block_tmPKcmP5mtr_tb+0x3fb)[0x111997b] /u01/mysql/bin/mysqld[0x10ff8ec] /u01/mysql/bin/mysqld(_Z32btr_copy_externally_stored_fieldPmPKhRK11page_size_tmP16mem_block_info_t+0x197)[0x10ffe47] /u01/mysql/bin/mysqld[0x10509a2] /u01/mysql/bin/mysqld[0x1053ba8] /u01/mysql/bin/mysqld(_Z15row_search_mvccPh15page_cur_mode_tP14row_prebuilt_tmm+0x1d2b)[0x105674b] /u01/mysql/bin/mysqld(_ZN11ha_innobase10index_readEPhPKhj16ha_rkey_function+0x29f)[0xf5302f] /u01/mysql/bin/mysqld(_ZN11ha_innobase11index_firstEPh+0x30)[0xf44d60] /u01/mysql/bin/mysqld(_ZN7handler14ha_index_firstEPh+0x164)[0x7f6264] /u01/mysql/bin/mysqld(_ZN7handler16read_range_firstEPK12st_key_rangeS2_bb+0xbc)[0x7fa4ac] /u01/mysql/bin/mysqld(_ZN7handler21multi_range_read_nextEPPc+0x99)[0x7efb19] /u01/mysql/bin/mysqld(_ZN18QUICK_RANGE_SELECT8get_nextEv+0x4b)[0xdc216b] /u01/mysql/bin/mysqld[0xbfa26a] /u01/mysql/bin/mysqld(_ZN14Sql_cmd_delete12mysql_deleteEP3THDy+0x1065)[0xdfb895] /u01/mysql/bin/mysqld(_ZN14Sql_cmd_delete7executeEP3THD+0xbf)[0xdfc03f] /u01/mysql/bin/mysqld(_Z21mysql_execute_commandP3THDb+0xa71)[0xc8a9d1] /u01/mysql/bin/mysqld(_Z11mysql_parseP3THDP12Parser_state+0x395)[0xc907c5] /u01/mysql/bin/mysqld(_Z16dispatch_commandP3THDPK8COM_DATA19enum_server_command+0x92a)[0xc911aa] /u01/mysql/bin/mysqld(_Z10do_commandP3THD+0x177)[0xc92ad7] /u01/mysql/bin/mysqld(_Z26threadpool_process_requestP3THD+0xcf)[0xd3525f] /u01/mysql/bin/mysqld[0xd4627e] /u01/mysql/bin/mysqld(pfs_spawn_thread+0x1b1)[0xed5581] /lib64/libpthread.so.0(+0x7dc5)[0x2af8fdcbadc5] /lib64/libc.so.6(clone+0x6d)[0x2af8ff021ced] which means that when trying to flush a page out of buffer pool, the page is still accessed by a thread (buf_fix_count > 0). I think that is a race condition on bpage->buf_fix_count,in this case one functions call stack is like this: buf_read_page_handle_error -> buf_LRU_free_one_page -> buf_LRU_block_remove_hashed “buf_read_page_handle_error” first assert condition "ut_ad(bpage->buf_fix_count == 0)," , then "buf_LRU_block_remove_hashed" assert again "ut_a(bpage->buf_fix_count == 0);", but I can't find out any lock protection can stop other thread to access the same page. e.g. during buf_read_page_handle_error's assert, it holds bufferpool mutex and page mutex, but it release page mutex before enter function buf_LRU_free_one_page. while other thread could be in this functions call stack buf_page_get_gen -> buf_pool_watch_set -> buf_block_fix when calling buf_pool_watch_set in buf_page_get_gen, it hold hash mutex and page mutex, but don't hold bufferpool lock. So, once a thread in buf_read_page_handle_error release page mutex before entering buf_LRU_free_one_page (only hold buffer pool mutex, and no hash lock and page lock), another thread could access same page in buf_page_get_gen->buf_pool_watch_set ( hold hash mutex and page mutex, and no buffer pool mutex), and increase page's buf_fix_count , which leading assert fail in buf_LRU_block_remove_hashed's "ut_a(bpage->buf_fix_count == 0);" of other thread. That is one case I can figure out, maybe there are some other cases. I find page's buf_fix_count is changed atomic, but I wonder what can promise that when a page is flushed out from buffer pool, it cann't be accessing by other thread at the same time. How to repeat: Hard to repeat Suggested fix: No