MySQL Bugs: #97136: racing condition over buf_chunk_map

Bug #97136	racing condition over buf_chunk_map_reg during buffer pool resizing
Submitted:	8 Oct 2019 4:56	Modified:	8 Oct 2019 22:40
Reporter:	Chen Fu	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S3 (Non-critical)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	buffer pool

Description:

buf_chunk_map_reg is a global variable, pointing to a chunk map to facilitate reverse lookup: from inside a page frame to its block descriptor. When a consistent non-locking scan thread encounters a corrupted REDO log, this map is queried to assist error logging.

During buffer pool resizing, this map is deleted after locking down all buffer pool instances. However, no page latch is obtained. Right after deleting the map, buf_chunk_map_reg points to a newly created empty map.

At the same time, there maybe a thread reading a REDO log, and subsequently traverses this map. This race condition may cause the scanning thread to follow either a dangling pointer that may cause a core dump, or an empty map and fail the reverse query.

How to repeat:
Bug discovered via code review, condition for triggering involves intricate timing between the resizing thread and the scanning thread. Difficult to reproduce.

Suggested fix:
During buffer pool resizing, if we are increasing its size, allocating new chunks don't have to be in the critical section. Likewise, if we are shrinking the buffer pool, after the withdraw target is met, the chunk memory can be released without locking.

If the above operations are moved before we lock down all buffer pool instances, a parallel chunk map can be created and populated while we allocating or deallocating chunks. When we finished populating this parallel chunk map, we can atomically replace the original global chunk map with this new one. Deletion of the old map should be delayed until safe to do so.

Hi Mr. Fu,

Thank you for your bug report.

However, your report is quite unclear.

First of all, we do not see which 8.0 release are you using. If you are not using the latest, take a look at the code of 8.0.18, which should be out relatively soon.

Second, page latch does not need to be taken, since the entire buffer pool is locked.

Third, thread that scans the REDO log takes a lock, on which buffer pool locking has to wait.

Fourth, you claim that you experienced a crash during such operations , but we do not see any evidence of it, nor do we see a printout from any assert.

Please, provide us with all required feedback.

made a mistake,  when disabling the AHI, btr_search_x_lock_all will wait for all AHI search thread to finish. Since buf_chunk_map_reg is only used by the AHI search thread, there will not be any overlapping.

Thank you for your feedback.