Bug #103943 Semi-Replication Rotate And Purge Relaylog Leads To Increased Response
Submitted: 8 Jun 2021 11:42 Modified: 24 Aug 2021 1:50
Reporter: chao gao Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S1 (Critical)
Version:5.7.33 OS:CentOS (7.6)
Assigned to: CPU Architecture:x86 (64)

[8 Jun 2021 11:42] chao gao
Description:
replication type:M--->S(semi-sync)
version:5.7.25-5.7.33 
max_binlog_size:1G
slave_parallel_type:logical_clock
relay_log_purge:ON
long_query_time:0.1
repl_semi_sync_master_wait_point:after_sync
repl_semi_sync_master_wait_for_slave_count:1
repl_semi_sync_master_wait_no_slave:on
repl_semi_sync_master_timeout=999999999
repl_semi_sync_master_enabled=on
repl_semi_sync_slave_enabled=on

wait_time(master wait slave ack ) will be greater than 200ms when the master rotate binlog.

How to repeat:
1.modify code
file:semisync_master.cc
wait_time = getWaitTime(start_ts);
sql_print_information("ack time is %llu", wait_time);  /*add by gaochao*/
make && make install

2.run sysbench until master rotate binlog,Observe the sysbench response and error log、slow log
[8 Jun 2021 11:49] chao gao
maybe this reason?
Replication: The value returned by a SHOW SLAVE STATUS statement for the total combined size of all existing relay log files (Relay_Log_Space) could become much larger than the actual disk space used by the relay log files. The I/O thread did not lock the variable while it updated the value, so the SQL thread could automatically delete a relay log file and write a reduced value before the I/O thread finished updating the value. The I/O thread then wrote its original size calculation, ignoring the SQL thread's update and so adding back the space for the deleted file. The Relay_Log_Space value is now locked during updates to prevent concurrent updates and ensure an accurate calculation. (Bug #26997096, Bug #87832)

5.7.24:rli->relay_log.harvest_bytes_written(&rli->log_space_total);
5.7.25:rli->relay_log.harvest_bytes_written(rli, true/*need_log_space_lock=true*/);

when i modify 5.7.25 true to false the Response is ok!!!
[9 Jun 2021 13:21] MySQL Verification Team
Hi,

With 5.7.24 you don't have this problem? only with 5.7.25+ ?

thanks
[10 Jun 2021 1:13] chao gao
YES,5.7.24 is ok.

This step is debug mysql5.7.32:
1.sql thread get the log_space_lock
......
2.mysql_file_delete(key_file_binlog, log_info.log_file_name, MYF(0)) function delete the relay_log file
......
3.sql thread release the log_space_lock

io thread can't get the log_space_lock from 1 to 3 and can't response ack to master
[10 Jun 2021 1:15] chao gao
5.7.24 is ok,this problem from 5.7.25 to newest
[10 Jun 2021 1:17] chao gao
I suggest release lock first then delete the relay_log file
[21 Jun 2021 5:55] MySQL Verification Team
Hi,

I think this is not a bug and that this is done intentionally but I will verify this behavior and give a dev team a chance to properly explain what they changed and why, or fix it, if it is really a bug.

Thanks for the report
[23 Jun 2021 6:10] chao gao
Thanks for your reply and look forward to your final verification result.
[21 Aug 2021 1:44] chao gao
&rli->log_space_lock is used to control the io/sql thread to update the relay_log_space_total,I think this lock holding range is too large, the holding lock can be divided into two stages, so as to avoid the useless lock when deleting the relaylog physical file:

source code(binlog.cc):
// rli->relay_log.purge_logs call MYSQL_BIN_LOG::purge_index_entry
1)
mysql_mutex_lock(&rli->log_space_lock);
rli->relay_log.purge_logs(to_purge_if_included, included,
                          false/*need_lock_index=false*/,
                          false/*need_update_threads=false*/,
                          &rli->log_space_total, true);
// Tell the I/O thread to take the relay_log_space_limit into account
rli->ignore_log_space_limit= 0;
mysql_mutex_unlock(&rli->log_space_lock);
2)
int MYSQL_BIN_LOG::purge_index_entry(THD *thd, ulonglong *decrease_log_space, bool need_lock_index)
{
......
mysql_file_delete(key_file_binlog, log_info.log_file_name, MYF(0))
......
  if (decrease_log_space)
    *decrease_log_space-= s.st_size;         // relay_log_space_total
......
}

suggest fix:
1)
rli->relay_log.purge_logs(to_purge_if_included, included,
                          false/*need_lock_index=false*/,
                          false/*need_update_threads=false*/,
                          &rli->log_space_total, true);
// Tell the I/O thread to take the relay_log_space_limit into account
mysql_mutex_lock(&rli->log_space_lock);
rli->ignore_log_space_limit= 0;
mysql_mutex_unlock(&rli->log_space_lock);

2)
int MYSQL_BIN_LOG::purge_index_entry(THD *thd, ulonglong *decrease_log_space, bool need_lock_index)
{
......
mysql_file_delete(key_file_binlog, log_info.log_file_name, MYF(0))   // Avoid holding locks,1G relaylog and this log in os cache delete will use about 100~200ms,if delete multiple relaylogs, this will be fatal in a semi-synchronous scenario
......
mysql_mutex_lock(&rli->log_space_lock);
  if (decrease_log_space)
    *decrease_log_space-= s.st_size;     // only hold the lock in relay_log_space_total update
mysql_mutex_unlock(&rli->log_space_lock);
......
}
[21 Aug 2021 1:48] chao gao
I have signed the OCA, but I didn’t get a reply
[24 Aug 2021 1:50] chao gao
Rotate Binlog will leads to Relaylog Rotate and purged,The reason of this problem is the long lock time during relaylog purged.