MySQL Bugs: #117448: When a large number of undo logs accumulate, reading the undo logs during the undo purge process may become a bottleneck

Bug #117448	When a large number of undo logs accumulate, reading the undo logs during the undo purge process may become a bottleneck
Submitted:	12 Feb 9:22	Modified:	12 Feb 14:16
Reporter:	jie xu	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	undo log purge

Description:
In our production environments, we sometimes encounter the following situation: A particularly old transaction may cause a large accumulation of undo logs, and the length of the history list might reach hundreds of millions.

At this point, when this old transaction is killed, the undo log starts to purge, but the speed of the undo log purge may not be very fast. Further investigation reveals that this is because there may be multiple threads purging undo records, but only the purge coordinator thread is responsible for reading undo log records.

In this scenario, the undo logs that need to be purged are mostly no longer in the buffer pool and must be read from disk. Since only the purge coordinator thread reads the undo log records, this can easily become a bottleneck.

This situation becomes even more severe when the instance is deployed on cloud storage, as the latency of reading data from cloud storage is higher, exacerbating the problem.

How to repeat:
First, start a read-write transaction A, update a few rows of data, and then leave it open without committing.

Next, using a separate set of connections with autocommit=on, aggressively update and delete data while monitoring the size of the undo log space. Once the space occupied by the undo logs becomes significantly larger than the buffer pool (for example, when the undo log size is already 10 times that of the buffer pool).stop the updates and deletions, and commit transaction A at this point.

Now, observe the behavior of the undo purge process. At this stage, it is easy to notice that the speed at which a single purge coordinator thread reads the undo logs falls far short of the system's I/O capacity limit.

Suggested fix:
We can attempt to modify the process of reading undo log records for purging into a multi-threaded form. Currently, the purge process works as follows: First, the purge coordinator thread reads a batch of undo log records from transactions in the history list in a globally ordered manner. It then distributes these undo log records to the purge worker threads. After the purge worker threads complete the purge, the purge coordinator thread reads the next batch of undo log records and continues the process.

Our idea is to change it as follows: The purge coordinator thread fetches the oldest rollback segment (the "oldest" refers to the smallest trx_no) from the purge_queue each time, and then assigns it to the purge worker threads sequentially. After a worker thread receives this rollback segment, it reads a batch of undo log records in the order of the history list of that rollback segment. Then, the worker thread puts the rollback segment back into the purge queue and proceeds to purge the batch of undo log records on its own. In this way, the process of reading undo log records would be handled by multiple threads.

One potential issue with this approach is that the process of purging undo log records may no longer follow the global order of the history list. However, after reviewing the code, we believe this should not cause any problems. One possible outcome is that undo log records from the same table might be assigned to different purge worker threads, which could potentially increase lock contention.

Hi Mr. xu,

Thank you for your performance improvement request.

We find it reasonable.

Verified as reported.