Bug #78674 Slave does not always reconnect properly after a master disconnect
Submitted: 1 Oct 2015 15:37 Modified: 14 Feb 2019 15:15
Reporter: Simon Mudd (OCA) Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.6.25 OS:Red Hat (CentOS 6)
Assigned to: MySQL Verification Team CPU Architecture:Any

[1 Oct 2015 15:37] Simon Mudd
Description:
After a failed connection to a master I would expect the slave to timeout and then try again. sometimes this doesn't happen as one would expect.

How to repeat:
root@myhost [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Reconnecting after a failed master event read
                  Master_Host: some-master.example.com
                  Master_User: some-user
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: binlog.029802
          Read_Master_Log_Pos: 42551433
               Relay_Log_File: relaylog.000009
                Relay_Log_Pos: 42551593
        Relay_Master_Log_File: binlog.029802
             Slave_IO_Running: Connecting    <===========
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 42551433
              Relay_Log_Space: 42553234
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 187225014
                  Master_UUID: 02956150-acc7-11e4-aed7-e4115ba829ae
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
1 row in set (0.00 sec)

I have a heartbeat mechanism by injecting a periodic event into a table and then I can check on the slave how far that event is behind.

This shows:

$ show_replication_status
Server:            some-server
State:             Reconnecting after a failed master event read
Master:            some-user@some-master.example.com:3306
Slave IO Running:  Connecting
Slave SQL Running: Yes
Replication Delay: NULL
Heartbeat Delay:   server_id 187225014: 916.94 (15m 16s behind master)

So the point is this happened 15 minutes ago yet the i/o thread is still stuck in the state: "Reconnecting after a failed master event read".

Note: stop slave; start slave; resolves the problem.

I've seen this behaviour on several servers.

Suggested fix:
In the mysql logs there is no evidence of a continual attempt to connect to the host and the master is up (other slaves are talking to it).

If the connection reattempt fails then this should timeout after some (short) period, and a new connection attempt should be made.
[12 Feb 2019 21:54] MySQL Verification Team
Hi Simon,

I seen this personally live more then once, the slave stuck in "connecting" state. Is this what you are seeing or it goes from "connecting" to just "no"?

In order to debug this we'd need to have a way to reproduce this on-demand but I was never able to. I seen it number of times between two data centers. Can you share a bit of light on the infrastructure you see this
 - lan? or ?
 - same switch or ?
 - high transaction volume?
 - big transactions?
 - kernel parameters?
 - does it happen with ssl also? (I never was able to reproduce with ssl enabled)

thanks
Bogdan
[14 Feb 2019 8:23] Simon Mudd
Given I reported this 4 years ago and I no longer run MySQL 5.6 I'm not going to be able to reproduce this again now.

I don't remember seeing it frequently and I haven't seen it for some time. So maybe it's worthwhile closing this as "can not repeat" and re-opening if the same circumstances are seen on a newer version (5.7 or 8.0).

That's probably the best way forward.   ??
[14 Feb 2019 15:15] MySQL Verification Team
Hi Simon,

I don't remember seeing fix for this in any of the releases since back then so I hoped you maybe remember some of the details :D so that I can put latest versions to the test.

I'll close now as "can't repeat" and if one hit it again one should reopen :)

all best
bogdan