Bug #119091 Purge Relaylog Leads To Semi-sync Increased Response even set relay_log_space_limit=0
Submitted: 29 Sep 9:56 Modified: 11 Oct 8:51
Reporter: karry zhang (OCA) Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version: OS:Any
Assigned to: MySQL Verification Team CPU Architecture:Any

[29 Sep 9:56] karry zhang
Description:
I've noticed jitter in semi-synchronous replication. After troubleshooting, I discovered it's caused by a relay log purge. I've also discovered the following historical bugs:

https://bugs.mysql.com/bug.php?id=103943

The MySQL Verification Team previously believed this was not a bug. 

The MySQL Verification Team previously considered this a non-bug. However, I still believe it is a bug because even if relay_log_space_limit is set to 0, meaning that the relay log space is unlimited, the IO thread still needs to wait for the SQL thread to purge the relay log, which is why semi-synchronous replication jitter occurs.

How to repeat:
You can following the method given in https://bugs.mysql.com/bug.php?id=103943

Suggested fix:
The most time consume fuction is mysql_file_delete. It is recommended that this operation not hold log_space_lock.
[7 Oct 0:56] MySQL Verification Team
This looks like a duplicate of Bug #103943

Bug #103943 is verified, it is not marked as "not a bug"
[11 Oct 8:51] karry zhang
I noticed the following comment from the MySQL Verification Team at https://bugs.mysql.com/bug.php?id=103943:

I think this is not a bug and that this has been done intentionally, but I will verify this behavior and give the dev team a chance to properly explain what they changed and why, or fix it if it truly is a bug.

The official fix has yet to be released.

I filed this bug to analyze it from a new perspective. We set relay_log_space_limit to 0 to ignore the space occupied by the relay log, but deleting the relay log caused semi-synchronous degradation. This behavior is unacceptable. I noticed in the code that some functions consider relay_log_space_limit to be 0 and do not wait, for example:
if (rli->log_space_limit &&
rli->log_space_limit < rli->log_space_total &&
!rli->ignore_log_space_limit)
if (wait_for_relay_log_space(rli)) {
LogErr(
ERROR_LEVEL,
ER_RPL_REPLICA_IO_THREAD_ABORTED_WAITING_FOR_RELAY_LOG_SPACE);
goto err;
}
However, the problem still exists. Considering that "show slave status" displays Relay_Log_Space, I think it is reasonable to decouple file deletion and Relay_Log_Space statistics.

This problem still exists in the latest version 8.0 and subsequent versions. I hope the official can fix it in the future.