MySQL Bugs: #27142: Slaves incorrectly report sync status when losing TCP connectivity from master

Bug #27142	Slaves incorrectly report sync status when losing TCP connectivity from master
Submitted:	14 Mar 2007 17:08	Modified:	4 Sep 2007 15:34
Reporter:	Guillaume Lefranc	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.0.32-log	OS:	Linux (Ubuntu 6.06 LTS x86_64)
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
When losing network connectivity between a master and a slave (for example, with a switch outage), the slave reports incorrectly synchronized status (seconds behind master = 0).
When updating SHOW SLAVE STATUS, the log_pos values are standing still.
Resuming network connectivity does not update the connection and the slaves won't resume replication.
A possible workaround is to issue SLAVE STOP and START STOP on every slave, and the replication status will be back to normal.

How to repeat:
In a switched networking environment, simulate a loss of connectivity between master and slave (power down the switch?)

Suggested fix:
One of our network engineers suggests that the issue may be related TCP stack handling, since the connection to the master might stay in ESTABLISHED state indefinitely, especially if you don't use Keep-Alive connections. The only way to shut down the connection is then to issue a STOP SLAVE command.

Thank you for the bug report.

mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State: Waiting for master to send event
                Master_Host: 192.168.0.33
                Master_User: miguel
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: binlog.000003
        Read_Master_Log_Pos: 541
             Relay_Log_File: skybr-relay-bin.000002
              Relay_Log_Pos: 439
      Relay_Master_Log_File: binlog.000003
           Slave_IO_Running: Yes
          Slave_SQL_Running: Yes
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 541
            Relay_Log_Space: 439
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: 0
1 row in set (0.00 sec)

mysql> stop slave
    -> ;
Query OK, 0 rows affected (0.01 sec)

mysql> start slave;
Query OK, 0 rows affected (0.00 sec)

mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State: Connecting to master
                Master_Host: 192.168.0.33
                Master_User: miguel
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: binlog.000003
        Read_Master_Log_Pos: 631
             Relay_Log_File: skybr-relay-bin.000002
              Relay_Log_Pos: 529
      Relay_Master_Log_File: binlog.000003
           Slave_IO_Running: No
          Slave_SQL_Running: Yes
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 631
            Relay_Log_Space: 529
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: NULL
1 row in set (0.00 sec)

mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State: Waiting for master to send event
                Master_Host: 192.168.0.33
                Master_User: miguel
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: binlog.000003
        Read_Master_Log_Pos: 631
             Relay_Log_File: skybr-relay-bin.000003
              Relay_Log_Pos: 232
      Relay_Master_Log_File: binlog.000003
           Slave_IO_Running: Yes
          Slave_SQL_Running: Yes
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 631
            Relay_Log_Space: 232
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: 0
1 row in set (0.00 sec)

Thank you for the report!

There are two issues. 
Wrt "slave reports incorrectly synchronized status (seconds behind master = 0)"
there is Bug #29309 Incorrect "Seconds_Behind_Master" value.

Wrt "Resuming network connectivity does not update the connection and the slaves won't resume replication" I think that's not a bug.
If slave does not receive anything from master longer than slave_net_timeout
it disconnects and tries to reconnect. If reconnecting fails the status should be as reported

  Slave_IO_Running: No

Stopping the slave's io thread, and thereafter resetting slave's status, was done on purpose to let to alert the user e.g via monitoring tools.
Yes, indeed, the user has to invoke `start slave' manually.

Meaning that the first issue is the only to handle I am setting the status as Duplicate (bug#29309)