Bug #74607 | slave io_thread may get stuck when using GTID and low slave_net_timeouts | ||
---|---|---|---|
Submitted: | 28 Oct 2014 16:25 | Modified: | 20 Mar 2015 17:42 |
Reporter: | Santosh Praneeth Banda | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
Version: | 5.6.21 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[28 Oct 2014 16:25]
Santosh Praneeth Banda
[6 Nov 2014 14:00]
MySQL Verification Team
Hello Santosh, Thank you for the bug report and steps. I observed both a) and b) behavior at my end. Thanks, Umesh
[6 Nov 2014 14:09]
MySQL Verification Team
Worklog details..
Attachment: 74607_steps.txt (text/plain), 11.80 KiB.
[20 Mar 2015 17:42]
David Moss
Thanks for your feedback, this has been fixed in upcoming versions and the following text was added to the 5.6.24 and 5.7.7 changelogs: When gtid_mode=ON and slave_net_timeout was set to a low value, the slave I/O thread could appear to hang. This was due to the slave heartbeat not being sent regularly enough when the dump thread found many events that could be skipped. The fix ensures that the heartbeat is sent correctly in such a situation.
[27 Apr 2015 12:39]
Laurynas Biveinis
commit 9ab03d0d41b25b86978b7a0aaf12f4a77c96dc27 Author: Venkatesh Duggirala <venkatesh.duggirala@oracle.com> Date: Mon Feb 16 17:28:50 2015 +0530 Bug#19975697 5.6: SLAVE IO_THREAD MAY GET STUCK WHEN USING GTID AND LOW SLAVE_NET_TIMEOUTS Problem: When GTID is enabled, dump thread is not checking the necessity of heartbeat event while it is scanning through the binary log files and skipping some GTID groups which were already present at Slave. Analysis: Dump thread sends a heartbeat event to Slave if there are no events to send for "heartbeat_period" seconds to make the connection between Master and Slave active. But when dump thread is scanning a binary log file and if it finds many GTID groups(/events) that needs to be skipped, it is not looking for this time period and not looking to send heartbeat event to Slave. There are two problems with the existing code in this scenario: Problem 1: If dump thread is spending more time in skipping the groups (many groups that needs to be skipped) and is not sending any heartbeat event, Slave thinks that Master is dead and it will try to reestablish the connection. Problem 2): Dump thread has two while loops to process the events at Master side, a ) outer loop: to process all binary log files one by one b ) inner loop: to process all the events one by one in a file Outer loop is having a flag 'thd->killed' to check if dump thread is killed in between processing the different files, if so, it exists the while loop. But Inner loop is not having any checks like this which end up in processing the full binary file( if it is huge, taking more time) which is unnecessary if this dump thread is killed due to some reason (One reason could be that this dump thread could have been detected as Zombie thread by another new dump request from Slave). Fix: 1) Dump thread will now check whether it is time to send an heartbeat event before skipping an event. If so, it will send one heartbeat event to Slave. 2) Inner loop also checks for thd->killed flag to avoid unnecessary work.
[14 Sep 2015 15:43]
Jon Stephens
See also BUG#78389.