| Bug #78674 | Slave does not always reconnect properly after a master disconnect | ||
|---|---|---|---|
| Submitted: | 1 Oct 2015 15:37 | Modified: | 14 Feb 2019 15:15 |
| Reporter: | Simon Mudd (OCA) | Email Updates: | |
| Status: | Can't repeat | Impact on me: | |
| Category: | MySQL Server: Replication | Severity: | S3 (Non-critical) |
| Version: | 5.6.25 | OS: | Red Hat (CentOS 6) |
| Assigned to: | MySQL Verification Team | CPU Architecture: | Any |
[12 Feb 2019 21:54]
MySQL Verification Team
Hi Simon, I seen this personally live more then once, the slave stuck in "connecting" state. Is this what you are seeing or it goes from "connecting" to just "no"? In order to debug this we'd need to have a way to reproduce this on-demand but I was never able to. I seen it number of times between two data centers. Can you share a bit of light on the infrastructure you see this - lan? or ? - same switch or ? - high transaction volume? - big transactions? - kernel parameters? - does it happen with ssl also? (I never was able to reproduce with ssl enabled) thanks Bogdan
[14 Feb 2019 8:23]
Simon Mudd
Given I reported this 4 years ago and I no longer run MySQL 5.6 I'm not going to be able to reproduce this again now. I don't remember seeing it frequently and I haven't seen it for some time. So maybe it's worthwhile closing this as "can not repeat" and re-opening if the same circumstances are seen on a newer version (5.7 or 8.0). That's probably the best way forward. ??
[14 Feb 2019 15:15]
MySQL Verification Team
Hi Simon, I don't remember seeing fix for this in any of the releases since back then so I hoped you maybe remember some of the details :D so that I can put latest versions to the test. I'll close now as "can't repeat" and if one hit it again one should reopen :) all best bogdan

Description: After a failed connection to a master I would expect the slave to timeout and then try again. sometimes this doesn't happen as one would expect. How to repeat: root@myhost [(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Reconnecting after a failed master event read Master_Host: some-master.example.com Master_User: some-user Master_Port: 3306 Connect_Retry: 60 Master_Log_File: binlog.029802 Read_Master_Log_Pos: 42551433 Relay_Log_File: relaylog.000009 Relay_Log_Pos: 42551593 Relay_Master_Log_File: binlog.029802 Slave_IO_Running: Connecting <=========== Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 42551433 Relay_Log_Space: 42553234 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 187225014 Master_UUID: 02956150-acc7-11e4-aed7-e4115ba829ae Master_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: Executed_Gtid_Set: Auto_Position: 0 1 row in set (0.00 sec) I have a heartbeat mechanism by injecting a periodic event into a table and then I can check on the slave how far that event is behind. This shows: $ show_replication_status Server: some-server State: Reconnecting after a failed master event read Master: some-user@some-master.example.com:3306 Slave IO Running: Connecting Slave SQL Running: Yes Replication Delay: NULL Heartbeat Delay: server_id 187225014: 916.94 (15m 16s behind master) So the point is this happened 15 minutes ago yet the i/o thread is still stuck in the state: "Reconnecting after a failed master event read". Note: stop slave; start slave; resolves the problem. I've seen this behaviour on several servers. Suggested fix: In the mysql logs there is no evidence of a continual attempt to connect to the host and the master is up (other slaves are talking to it). If the connection reattempt fails then this should timeout after some (short) period, and a new connection attempt should be made.