| Bug #74607 | slave io_thread may get stuck when using GTID and low slave_net_timeouts | ||
|---|---|---|---|
| Submitted: | 28 Oct 2014 16:25 | Modified: | 20 Mar 2015 17:42 |
| Reporter: | Santosh Praneeth Banda | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
| Version: | 5.6.21 | OS: | Any |
| Assigned to: | CPU Architecture: | Any | |
[28 Oct 2014 16:25]
Santosh Praneeth Banda
[6 Nov 2014 14:00]
MySQL Verification Team
Hello Santosh, Thank you for the bug report and steps. I observed both a) and b) behavior at my end. Thanks, Umesh
[6 Nov 2014 14:09]
MySQL Verification Team
Worklog details..
Attachment: 74607_steps.txt (text/plain), 11.80 KiB.
[20 Mar 2015 17:42]
David Moss
Thanks for your feedback, this has been fixed in upcoming versions and the following text was added to the 5.6.24 and 5.7.7 changelogs: When gtid_mode=ON and slave_net_timeout was set to a low value, the slave I/O thread could appear to hang. This was due to the slave heartbeat not being sent regularly enough when the dump thread found many events that could be skipped. The fix ensures that the heartbeat is sent correctly in such a situation.
[27 Apr 2015 12:39]
Laurynas Biveinis
commit 9ab03d0d41b25b86978b7a0aaf12f4a77c96dc27
Author: Venkatesh Duggirala <venkatesh.duggirala@oracle.com>
Date: Mon Feb 16 17:28:50 2015 +0530
Bug#19975697 5.6: SLAVE IO_THREAD MAY GET STUCK WHEN USING
GTID AND LOW SLAVE_NET_TIMEOUTS
Problem: When GTID is enabled, dump thread is not checking
the necessity of heartbeat event while it is scanning
through the binary log files and skipping some GTID groups
which were already present at Slave.
Analysis: Dump thread sends a heartbeat event to Slave if
there are no events to send for "heartbeat_period" seconds
to make the connection between Master and Slave active.
But when dump thread is scanning a binary log file and
if it finds many GTID groups(/events) that needs to be skipped,
it is not looking for this time period and not
looking to send heartbeat event to Slave.
There are two problems with the existing code in this
scenario:
Problem 1: If dump thread is spending more time in skipping the groups
(many groups that needs to be skipped) and is not sending any
heartbeat event, Slave thinks that Master is dead and it will
try to reestablish the connection.
Problem 2): Dump thread has two while loops to process the events at Master
side, a ) outer loop: to process all binary log files one by one
b ) inner loop: to process all the events one by one in a file
Outer loop is having a flag 'thd->killed' to check if dump thread is killed
in between processing the different files, if so, it exists the while loop.
But Inner loop is not having any checks like this which end up in processing
the full binary file( if it is huge, taking more time) which is unnecessary
if this dump thread is killed due to some reason (One reason could be that
this dump thread could have been detected as Zombie thread by another new
dump request from Slave).
Fix:
1) Dump thread will now check whether it is time to send an heartbeat event
before skipping an event. If so, it will send one heartbeat event to Slave.
2) Inner loop also checks for thd->killed flag to avoid unnecessary work.
[14 Sep 2015 15:43]
Jon Stephens
See also BUG#78389.
