MySQL Bugs: #74607: slave io_thread may get stuck when using GTID and low slave_net

Bug #74607	slave io_thread may get stuck when using GTID and low slave_net_timeouts
Submitted:	28 Oct 2014 16:25	Modified:	20 Mar 2015 17:42
Reporter:	Santosh Praneeth Banda	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.6.21	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Note the following steps reproduces two bugs

a. slave io_thread stuck without actually progressing
b. very high cpu usage on master due to huge number of zombie dump threads running (contention on semi-sync plugin locks)

1. use GTID auto positioning protocol
2. use 1GB binlog file size. stop slave at high binlog position (>800MB)
3. set 1 second slave_net_timeout on slave
4. free page caches on master so binlog file read is slow. You can use this bash script on linux while[ 1 ]; do echo 3 > /proc/sys/vm/drop_caches done
5. start slave io_thread
6. "show slave status" on slave to verify slave io_thread is not progressing.
7. check "show processlist" on master to see zombie dump threads are increasing

How to repeat:
see description

Suggested fix:
1. send heartbeat events to slave when a GTID group is skipped so that slave io_thread doesn't hit the read timeout

2. Check for thd->killed status in inner loop of mysql_binlog_send() so the zombie dump threads exists immediately. Note the outer loop already has a check for thd->killed

Hello Santosh,

Thank you for the bug report and steps.
I observed both a) and b) behavior at my end.

Thanks,
Umesh

Worklog details..

Attachment: 74607_steps.txt (text/plain), 11.80 KiB.

Thanks for your feedback, this has been fixed in upcoming versions and the following text was added to the 5.6.24 and 5.7.7 changelogs:

When gtid_mode=ON and slave_net_timeout was set to a low value, the slave I/O thread could appear to hang. This was due to the slave heartbeat not being sent regularly enough when the dump thread found many events that could be skipped. The fix ensures that the heartbeat is sent correctly in such a situation.

commit 9ab03d0d41b25b86978b7a0aaf12f4a77c96dc27
Author: Venkatesh Duggirala <venkatesh.duggirala@oracle.com>
Date:   Mon Feb 16 17:28:50 2015 +0530

    Bug#19975697 5.6: SLAVE IO_THREAD MAY GET STUCK WHEN USING
    GTID AND LOW SLAVE_NET_TIMEOUTS
    
    Problem: When GTID is enabled, dump thread is not checking
    the necessity of heartbeat event while it is scanning
    through the binary log files and skipping some GTID groups
    which were already present at Slave.
    
    Analysis: Dump thread sends a heartbeat event to Slave if
    there are no events to send for "heartbeat_period" seconds
    to make the connection between Master and Slave active.
    But when dump thread is scanning a binary log file and
    if it finds many GTID groups(/events) that needs to be skipped,
    it is not looking for this time period and not
    looking to send heartbeat event to Slave.
    
    There are two problems with the existing code in this
    scenario:
    
    Problem 1: If dump thread is spending more time in skipping the groups
    (many groups that needs to be skipped) and is not sending any
    heartbeat event, Slave thinks that Master is dead and it will
    try to reestablish the connection.
    Problem 2): Dump thread has two while loops to process the events at Master
    side, a ) outer loop: to process all binary log files one by one
          b ) inner loop: to process all the events one by one in a file
    
    Outer loop is having a flag 'thd->killed' to check if dump thread is killed
    in between processing the different files, if so, it exists the while loop.
    
    But Inner loop is not having any checks like this which end up in processing
    the full binary file( if it is huge, taking more time) which is unnecessary
    if this dump thread is killed due to some reason (One reason could be that
    this dump thread could have been detected as Zombie thread by another new
    dump request from Slave).
    
    Fix:
    1) Dump thread will now check whether it is time to send an heartbeat event
    before skipping an event. If so, it will send one heartbeat event to Slave.
    2) Inner loop also checks for thd->killed flag to avoid unnecessary work.

See also BUG#78389.