Bug #113187 replica SQL thread purge applied relay logs blocked commit on source node
Submitted: 22 Nov 2023 16:24 Modified: 26 Nov 2023 6:13
Reporter: Vitale Jiang Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S5 (Performance)
Version:8.0 and above OS:Any
Assigned to: CPU Architecture:Any
Tags: performance hang, replication, semisync

[22 Nov 2023 16:24] Vitale Jiang
Description:
We have received some complaints from time-critical biz developers, they found performance impact while committing transaction, after inquiry into commit time cost, we noticed that sessions were blocked on waiting for semi-sync ACK, at that time binlog rotated just hundred milliseconds ago.

Then we found while SQL thread purge relay logs, IO thread was waiting for log_space_lock:
    queue_event 
        -> write_buffer
        -> after_write_to_relay_log
        -> harvest_bytes_written (mysql_mutex_lock(&rli->log_space_lock)) <<<

And we also found replica SQL thread require log_space_lock before purging applied relay logs, which happened after read EOF of relay log:
    read_next_event 
        -> move_to_next_log 
        -> purge_applied_logs 
        -> purge_logs (mysql_mutex_lock(&m_rli->log_space_lock)) <<<
        -> purge_index_entry 
        -> mysql_file_delete (cost 100+ms) <<<

How to repeat:
1. Configure semisync replication between two nodes: 
Source --- semisync ---> Replica

2. Prepare sysbench tables (source: 127.0.0.1:3306)
yum install -y sysbench
sysbench --mysql-host=127.0.0.1--mysql-port=3306 --mysql-user=sysbench --mysql-password=sysbench --mysql-db=sysbench --table_size=1000000 --tables=100 --time=900     --threads=32 /usr/share/sysbench/oltp_write_only.lua prepare

3. Sysbench source node
sysbench --mysql-host=127.0.0.1--mysql-port=3306 --mysql-user=sysbench --mysql-password=sysbench --mysql-db=sysbench --table_size=1000000 --tables=100 --time=900 --threads=32 --report-interval=1 --percentile=100 /usr/share/sysbench/oltp_update_index.lua run

4. Watch the latency we would see max latency surged once relay log purged on replica.

Suggested fix:
The problem is SQL thread holding log_space_lock until all applied relay logs been deleted, which cost around 100ms each file even on SSD, if we just merely acquire log_space_lock during increasing log_space_total, IO thread won't be blocked so long.
[22 Nov 2023 16:31] Vitale Jiang
Just merely acquire log_space_lock during increasing log_space_total instead of deleting relay log

Attachment: patch.txt (text/plain), 1.97 KiB.

[23 Nov 2023 21:38] MySQL Verification Team
Thank you very much for your patch contribution, we appreciate it!

In order for us to continue the process of reviewing your contribution to MySQL, please send us a signed copy of the Oracle Contributor Agreement (OCA) as outlined in https://oca.opensource.oracle.com

Signing an OCA needs to be done only once and it's valid for all other Oracle governed Open Source projects as well.

Getting a signed/approved OCA on file will help us facilitate your contribution - this one, and others in the future.  

Please let me know, if you have any questions.

Thank you for your interest in MySQL.
[25 Nov 2023 15:07] Vitale Jiang
Thanks, I've submitted OCA.
[29 May 2024 10:08] MySQL Verification Team
This bug has a new patch found in the following report:

https://bugs.mysql.com/bug.php?id=115159
[6 Jun 2024 3:40] Vitale Jiang
As a workaround, maybe we can set relay_log_space_limit = 0 when startup, since it's a read only variable.