Bug #101056 Ack_receiver may lost the slave semi-sync ack due to net timeout
Submitted: 5 Oct 2020 17:35 Modified: 16 Jun 2022 14:52
Reporter: dennis GAO (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:8.0.19 OS:Any
Assigned to: CPU Architecture:Any
Tags: Contribution

[5 Oct 2020 17:35] dennis GAO
Description:
For semi-sync master, the Ack_receiver will loop all semi-sync slaves to receive slave ack response packet and invoke reportReplyPacket.

But if the Ack_receiver read one slave ack packet partial with net timeout, Ack_receiver will not be able to get this ack packet correctly any more.

The reason is:
1. Before invoke my_net_read to read client ack packet, "net_clear(&net, false);" is invoked, which will clear the previous partial read packet info.
2. All slaves share one net.buff, so the next slave must do net_clear before read its ack packet
3. my_net_read does not designed to handle packet partial received situation. For each invoke, it will always use net->buff to receive the packet header and then use net->buff to receive the packet body again. Each receiving store the net info from the begin of the net->buff, which means the net->buff is reset by each net_read_raw_loop.

So if partial read packet, the received part will always be abandoned, so the rest part will not be able to receive correctly.
For semi-sync, the Ack_receiver may get error "packet out of order" or "magic number error" or "larger than max_allowed_packet", if the ack packet partial received.

How to repeat:
Run the test case in the attached patch.
The master server will not get enough ack response from slaves due to ack packet partial received, and the semi-sync timeout will happen.

Suggested fix:
In order to fix this bug:
1. we should use different net buffer for each semi-sync slave
2. store the partial read info into the slave, and use this info to reset the NET object before invoking my_net_read

Add the following attributes into the struct Slave:

  unsigned char *buff, *buff_end, *write_pos, *read_pos;
  unsigned char net_buff[REPLY_MESSAGE_MAX_LENGTH];
  bool has_read_head;
  ulong body_len_nead;
  ulong has_read_len;
  uchar pkt_nr;
  uchar compress_pkt_nr;

Above attributes will provide the separate net buffer for each slave, and kept the partial packet info.

And for struct NET, we add the following attributes:

  bool need_handle_partial_read;
  bool partial_read_head_once;
  bool partial_read_body_once;
  bool has_read_head;
  ulong body_len_nead;
  ulong has_read_len;

So each time, we use the slave's kept info to reset NET before invoke my_net_read, and after packet handling , we store the current NET attributes into the related slave.

The my_net_read will calculate the rest part of packet from the info of NET, and return the correct packet for reportReplyPacket.
[5 Oct 2020 17:38] dennis GAO
adding the patch as contribution

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: fix_semi_sync_partial_read_bug-v1.diff (text/x-patch), 18.28 KiB.

[6 Oct 2020 5:31] MySQL Verification Team
Hello dennis GAO,

Thank you for the report and contribution.

regards,
Umesh
[28 Apr 2022 12:58] MySQL Verification Team
===============================================
Posted by developer in internal bug
===============================================

Hi,

We would like to know exactly why a timeout occurs during partial read.

The test provided in the BugReport artificially simulate the partial read,
but
could you please provide exact scenario that can cause partial read?

This will help us to understand the real root cause and provide the solution
accurately.

Thank you.
[28 Apr 2022 15:19] dennis GAO
Hi,

This bug is found in our customer test environment.
High stress test with Gigabit network.
Normally one transaction only cost 60ms, but We find some long time commit in some second. Then we find the following error message in mysql log:

2020-05-04T23:30:04.908982+08:00 0 [ERROR] /app/dbcluster/xxx/mysql/bin/mysqld: Got timeout reading communication packets
2020-05-04T23:30:04.959892+08:00 0 [ERROR] Read semi-sync reply magic number error
2020-05-04T23:30:04.959919+08:00 0 [ERROR] /app/dbcluster/xxx/mysql/bin/mysqld: Got packets out of order
2020-05-04T23:30:04.959939+08:00 0 [ERROR] /app/dbcluster/xxx/mysql/bin/mysqld: Got packets out of order
2020-05-04T23:30:04.959946+08:00 0 [ERROR] /app/dbcluster/xxx/mysql/bin/mysqld: Got packets out of order
2020-05-04T23:30:04.959953+08:00 0 [ERROR] /app/dbcluster/xxx/mysql/bin/mysqld: Got a packet bigger than 'max_allowed_packet' bytes
2020-05-04T23:30:04.959961+08:00 0 [ERROR] /app/dbcluster/xxx/mysql/bin/mysqld: Got packets out of order

The max_allowed_packet is set big enough, and if we close semi-sync, above error and long time transaction will become much better.

For more analysis, we find the network may have some problems:
1. ping time for slave is much higher than master
2. find some IcmpInErrors
3. find some rx dropped
4. By using tcpdump, we find some incomplete packet
[28 Apr 2022 16:00] MySQL Verification Team
Thank you for the details.
[4 May 2022 5:11] Justin Jose
Posted by developer:
 
Hi,

Thank you for the details, very much appreciated.

We would need a reproducible scenario to confirm the root cause.
Could you please provide reproducible scenario/steps that caused the partial read?

Also, could you please provide below details
1. What is the topology shape, specifically, how many semi-sync replicas are there per source and intermediate source
2. Is compression enabled for communication

Thank you!
[10 May 2022 8:06] dennis GAO
Hi,

For reproducible scenario, I suggest to do a high press test(such as benchmarksql with 1024 concurrency) with 100Mb/s internet interface. 

The replica topo is m1->s1, m1->s2, m1->s3. m1 has three semi-sync slaves.
The compression is disabled.

But it may be hard to reproduce, cause it depends on some network behavior.

IMO, if the slave mysql send a semi-sync ack response, there is no guarantee the whole ack response will be in one TCP segment.
 
It may be divided into two TCP segments. If the first TCP segment received by master mysql, but the second TCP segment is blocked by the network congestion, the time interval between the 2 TCP segments sent to master may exceed 1ms.
So the my_net_read will read part bytes of the ACK packet, and return packet_error caused by wait timeout(exceed 1ms), leaving the rest bytes of ACK packet alone. So in the new next round, my_net_read will start from the rest bytes of the last ACK packet.
[11 May 2022 12:00] Justin Jose
Posted by developer:
 
Thank you very much for the details.
[27 May 2022 20:30] Matt Lord
I think this is the same underlying issue reported here: https://bugs.mysql.com/bug.php?id=104015
[16 Jun 2022 14:52] Margaret Fisher
Posted by developer:
 
Changelog entry added for MySQL 8.0.30:

MySQL’s semisynchronous replication did not respect the value of the net_read_timeout system variable and forced a read timeout of one millisecond. This could result in the function experiencing partial reads of acknowledgment messages and packets arriving out of order, while other connections in the MySQL system were functioning correctly. The value of the net_read_timeout system variable is now applied to connections for semisynchronous replication.