Description:
This bug is hard to reproduce. I was not able to reproduce it in development environment, it happened several times in production on version 5.7. However the code and coredump analysis shows the problem very clearly.
Additionally I see that the same can happen for 5.6 and 8.0 as the code around/execution path is very similar.
How to repeat:
Detailed problem description:
IO subsystem allows several io read requests to be buffered and once buffering is finished by upper layer, the whole batch can be submitted to the kernel. It is used by linear read-ahead of pages.
Buffering is done in AIO::linux_dispatch().
Buffering stage:
1. buf0rea.cc, buf_read_ahead_linear(). linear read-ahead initiated by InnoDB calls buf_read_page_low() with 'should_buffer' 'true'. If we are inside ibuf routine, ibuf_mode = BUF_READ_IBUF_PAGES_ONLY.
2. buf0rea.cc, buf_read_page_low() calls _fil_io() with should_buffer == true
3. fil0fil.cc, _fil_io() calls os_aio() with should_buffer == true and mode = OS_AIO_IBUF;
4. os0file.h, os_aio() -> pfs_os_aio_func(). This is macro.
5. os0file.ic, pfs_os_aio_func() calls os_aio_func()
6. os0file.cc, os_aio_func() reserves IO slot (AIO::select_slot_array(mode = OS_AIO_IBUF) and then array->reserve_slot())
7. Calling AIO::select_slot_array() with mode == OS_AIO_IBUF causes return of AIO::s_ibuf (not AIO::s_reads) then array->linux_dispatch() is called with reserved slot and should_buffer == true.
8. request is buffered in m_pending and m_count[0], IO is NOT submitted to the kernel.
9. Above repeats for all reads that are to be buffered
Submitting IO batch to be processed:
1. buf0rea.cc, when buf_read_ahead_linear() finishes buffering of IO reads, it calls os_aio_dispatch_read_array_submit(). This resolves to AIO::os_aio_dispatch_read_array_submit_low(true)
2. os0file.cc, AIO::os_aio_dispatch_read_array_submit_low() iterates over AIO::s_reads array and submits buffered IO's.
3. When all are submitted m_pending and m_count are cleared.
But we have buffered our IOs in AIO::s_ibuf (see above), so they are still pending and never submitted to the kernel.
In parallel to linear read-ahead the following happened which caused server to stuck:
Connection 1:
1. User connection executes INSERT that causes secondary index page split
2. Insertion to the index goes to innodb_change_buffer
3. After try of optimistic insertion, we fall-back to pessimistic insert (btr_cur_pessimistic_insert())
4. Pessimistic insert causes new file page allocation for insert buffer (btr_page_alloc_for_ibuf())
5. New page is pointed by the first node of PAGE_BTR_IBUF_FREE_LIST. This page is already in mem, or read from file. What is important is that this operation succeeds.
6. We need to mark 2nd node from PAGE_BTR_IBUF_FREE_LIST as being 1st one now, because we removed 1st node from the list (allocation) (flst_remove())
7. 2nd node is in another file page than 1st node was, so we need to load 2nd node (call to fut_get_ptr())
8. We are in buf_page_get_gen(). 2nd node is not in memory, so we need to load it from file
9. We initiate synchronous page read (call to buf_read_page()). srv_random_read_ahead == 0, so we do not start async random read-ahead next to it.
10. We call of buf_wait_for_read() where we expect the page transfer to be finished (it was synchronous), but the page io_fix is still BUF_IO_READ and block->lock is locked, so we wait on it
11. As the result we are holding ibuf_mutex
Connection 2:
query: SHOW ENGINE INNODB STATUS
holds: srv_innodb_monitor_mutex, srv_printf_innodb_monitor() (waiter: Connection 3)
waits: ibuf_mutex, ibuf_print() (holder: Connection 1)
Connection 3:
query: show status like 'Uptime'
holds: LOCK_Status, PFS_status_variable_cache::do_materialize_all() (waiter: Connection 4)
waits: srv_innodb_monitor_mutex, srv_export_innodb_status() (holder: Connection 2)
Connection 4 (all user connections):
waiting in THD::release_resources())
waits: LOCK_status (owned by Connection 3)
Suggested fix:
Fix dispatching of buffered IO transfers in the way that buffered slots from s_ibuf are triggered as well.