Bug #113115 semi-sync slave exit checksum error and master exit with net_flush error
Submitted: 17 Nov 2023 2:20 Modified: 27 Dec 2023 6:35
Reporter: phoenix Zhang (OCA) Email Updates:
Status: Can't repeat Impact on me:
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.7.21 OS:SUSE
Assigned to: MySQL Verification Team CPU Architecture:Any

[17 Nov 2023 2:20] phoenix Zhang
We have one system, which deploy with 1 master + 3 semi-sync slave + 1 async slave + 1 outside binlog tool using BINLOG_DUMP command.

In one moment, all 3 semi-sync slave nodes exit with checksum error; theirs binlog_dump thread in master node print semi-sync ack magic number error, and net_flush error, and exit, which happen after slave error. The outside binlog tool's related binlog_dump thread also exit at that moment. However, the async slave keep normal.

The detail log please see attach file: log.txt

As we see, all semi-sync slaves report the same checksum error. And after slave checksum error, and slave IO thread exit, its related master then print Read semi-sync reply magic number error, and Semi-sync master failed on net_flush() before waiting for slave reply error, and then exit binlog_dump thread.

All the error happen in the same timestamp, which is 2023-11-03T11:46:03. And from OS and NET level does not find any exception error log.

How to repeat:
It is an online database, which happen once, and can not repeat in local test environment.
[17 Nov 2023 2:21] phoenix Zhang
The detail log info

Attachment: log.txt (text/plain), 9.78 KiB.

[17 Nov 2023 23:21] MySQL Verification Team

Thank you for the report but I have issues reproducing this. 5.7.21 is rather old, 5.7.44 is what is current 5.7 release, please upgrade.
[27 Dec 2023 6:35] phoenix Zhang
i find a similar bug: https://bugs.mysql.com/bug.php?id=84752

in bug#84752, it report error: bogus data in log event. 

from source code, in Log_event::read_log_event, it will "data_len= uint4korr(buf + EVENT_LEN_OFFSET)", if data_len less than 19, it will report bogus data error. otherwise, if data_len still in-correct, but not too small or too large, then, it may generate checksum error.

In bug#84752, your developer teams say this bug is fixed by Bug #22252394, which release-node is:

With sync_binlog=1 set, if the binary log was rotated during a commit before the binary log end position was updated, replication stopped on the slave because the server attempted to use the old binary log end position with the new binary log file. The server now compares the binary log file name with the active binary log file when updating the binary log end position, so that the issue does not occur.

However, i'am not sure Bug #22252394 can fix bug#84752. The bug#22252394 will report "unknown error reading log event on the master" or "binlog truncated in the middle of event", since it cannot read data which use end_pos of previous file. But in bug#84752, it read the event_header, while buffer is not correct.
And in bug#84752, where the "bogus data in log_event" happen in pos=211359709,which about 200M, it hard happen when binlog rotate.
[27 Dec 2023 8:04] MySQL Verification Team

Yes but I cannot reproduce this with 5.7.44 so if it is fixed by the fix for the bug you found or some other it is not really relevant as long as it is fixed. Can you reproduce with 5.7.44 or ? 'cause it really makes no difference if you can or cannot do it with 5.7.21