MySQL Bugs: #113187: replica SQL thread purge applied relay logs blocked commit on source node

Bug #113187	replica SQL thread purge applied relay logs blocked commit on source node
Submitted:	22 Nov 2023 16:24	Modified:	26 Nov 2023 6:13
Reporter:	Vitale Jiang	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S5 (Performance)
Version:	8.0 and above	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	performance hang, replication, semisync

Description:
We have received some complaints from time-critical biz developers, they found performance impact while committing transaction, after inquiry into commit time cost, we noticed that sessions were blocked on waiting for semi-sync ACK, at that time binlog rotated just hundred milliseconds ago.

Then we found while SQL thread purge relay logs, IO thread was waiting for log_space_lock:
    queue_event 
        -> write_buffer
        -> after_write_to_relay_log
        -> harvest_bytes_written (mysql_mutex_lock(&rli->log_space_lock)) <<<

And we also found replica SQL thread require log_space_lock before purging applied relay logs, which happened after read EOF of relay log:
    read_next_event 
        -> move_to_next_log 
        -> purge_applied_logs 
        -> purge_logs (mysql_mutex_lock(&m_rli->log_space_lock)) <<<
        -> purge_index_entry 
        -> mysql_file_delete (cost 100+ms) <<<

How to repeat:
1. Configure semisync replication between two nodes: 
Source --- semisync ---> Replica

2. Prepare sysbench tables (source: 127.0.0.1:3306)
yum install -y sysbench
sysbench --mysql-host=127.0.0.1--mysql-port=3306 --mysql-user=sysbench --mysql-password=sysbench --mysql-db=sysbench --table_size=1000000 --tables=100 --time=900     --threads=32 /usr/share/sysbench/oltp_write_only.lua prepare

3. Sysbench source node
sysbench --mysql-host=127.0.0.1--mysql-port=3306 --mysql-user=sysbench --mysql-password=sysbench --mysql-db=sysbench --table_size=1000000 --tables=100 --time=900 --threads=32 --report-interval=1 --percentile=100 /usr/share/sysbench/oltp_update_index.lua run

4. Watch the latency we would see max latency surged once relay log purged on replica.

Suggested fix:
The problem is SQL thread holding log_space_lock until all applied relay logs been deleted, which cost around 100ms each file even on SSD, if we just merely acquire log_space_lock during increasing log_space_total, IO thread won't be blocked so long.

Just merely acquire log_space_lock during increasing log_space_total instead of deleting relay log

Attachment: patch.txt (text/plain), 1.97 KiB.

Thank you very much for your patch contribution, we appreciate it!

In order for us to continue the process of reviewing your contribution to MySQL, please send us a signed copy of the Oracle Contributor Agreement (OCA) as outlined in https://oca.opensource.oracle.com

Signing an OCA needs to be done only once and it's valid for all other Oracle governed Open Source projects as well.

Getting a signed/approved OCA on file will help us facilitate your contribution - this one, and others in the future.  

Please let me know, if you have any questions.

Thank you for your interest in MySQL.

Thanks, I've submitted OCA.

This bug has a new patch found in the following report:

https://bugs.mysql.com/bug.php?id=115159

As a workaround, maybe we can set relay_log_space_limit = 0 when startup, since it's a read only variable.