Bug #101056 | Ack_receiver may lost the slave semi-sync ack due to net timeout | ||
---|---|---|---|
Submitted: | 5 Oct 2020 17:35 | Modified: | 16 Jun 2022 14:52 |
Reporter: | dennis GAO (OCA) | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
Version: | 8.0.19 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | Contribution |
[5 Oct 2020 17:35]
dennis GAO
[5 Oct 2020 17:38]
dennis GAO
adding the patch as contribution (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.
Contribution: fix_semi_sync_partial_read_bug-v1.diff (text/x-patch), 18.28 KiB.
[6 Oct 2020 5:31]
MySQL Verification Team
Hello dennis GAO, Thank you for the report and contribution. regards, Umesh
[28 Apr 2022 12:58]
MySQL Verification Team
=============================================== Posted by developer in internal bug =============================================== Hi, We would like to know exactly why a timeout occurs during partial read. The test provided in the BugReport artificially simulate the partial read, but could you please provide exact scenario that can cause partial read? This will help us to understand the real root cause and provide the solution accurately. Thank you.
[28 Apr 2022 15:19]
dennis GAO
Hi, This bug is found in our customer test environment. High stress test with Gigabit network. Normally one transaction only cost 60ms, but We find some long time commit in some second. Then we find the following error message in mysql log: 2020-05-04T23:30:04.908982+08:00 0 [ERROR] /app/dbcluster/xxx/mysql/bin/mysqld: Got timeout reading communication packets 2020-05-04T23:30:04.959892+08:00 0 [ERROR] Read semi-sync reply magic number error 2020-05-04T23:30:04.959919+08:00 0 [ERROR] /app/dbcluster/xxx/mysql/bin/mysqld: Got packets out of order 2020-05-04T23:30:04.959939+08:00 0 [ERROR] /app/dbcluster/xxx/mysql/bin/mysqld: Got packets out of order 2020-05-04T23:30:04.959946+08:00 0 [ERROR] /app/dbcluster/xxx/mysql/bin/mysqld: Got packets out of order 2020-05-04T23:30:04.959953+08:00 0 [ERROR] /app/dbcluster/xxx/mysql/bin/mysqld: Got a packet bigger than 'max_allowed_packet' bytes 2020-05-04T23:30:04.959961+08:00 0 [ERROR] /app/dbcluster/xxx/mysql/bin/mysqld: Got packets out of order The max_allowed_packet is set big enough, and if we close semi-sync, above error and long time transaction will become much better. For more analysis, we find the network may have some problems: 1. ping time for slave is much higher than master 2. find some IcmpInErrors 3. find some rx dropped 4. By using tcpdump, we find some incomplete packet
[28 Apr 2022 16:00]
MySQL Verification Team
Thank you for the details.
[4 May 2022 5:11]
Justin Jose
Posted by developer: Hi, Thank you for the details, very much appreciated. We would need a reproducible scenario to confirm the root cause. Could you please provide reproducible scenario/steps that caused the partial read? Also, could you please provide below details 1. What is the topology shape, specifically, how many semi-sync replicas are there per source and intermediate source 2. Is compression enabled for communication Thank you!
[10 May 2022 8:06]
dennis GAO
Hi, For reproducible scenario, I suggest to do a high press test(such as benchmarksql with 1024 concurrency) with 100Mb/s internet interface. The replica topo is m1->s1, m1->s2, m1->s3. m1 has three semi-sync slaves. The compression is disabled. But it may be hard to reproduce, cause it depends on some network behavior. IMO, if the slave mysql send a semi-sync ack response, there is no guarantee the whole ack response will be in one TCP segment. It may be divided into two TCP segments. If the first TCP segment received by master mysql, but the second TCP segment is blocked by the network congestion, the time interval between the 2 TCP segments sent to master may exceed 1ms. So the my_net_read will read part bytes of the ACK packet, and return packet_error caused by wait timeout(exceed 1ms), leaving the rest bytes of ACK packet alone. So in the new next round, my_net_read will start from the rest bytes of the last ACK packet.
[11 May 2022 12:00]
Justin Jose
Posted by developer: Thank you very much for the details.
[27 May 2022 20:30]
Matt Lord
I think this is the same underlying issue reported here: https://bugs.mysql.com/bug.php?id=104015
[16 Jun 2022 14:52]
Margaret Fisher
Posted by developer: Changelog entry added for MySQL 8.0.30: MySQL’s semisynchronous replication did not respect the value of the net_read_timeout system variable and forced a read timeout of one millisecond. This could result in the function experiencing partial reads of acknowledgment messages and packets arriving out of order, while other connections in the MySQL system were functioning correctly. The value of the net_read_timeout system variable is now applied to connections for semisynchronous replication.