MySQL Bugs: #69300: Slave IO_THREAD Loses State When Connection is Killed on Master

Bug #69300	Slave IO_THREAD Loses State When Connection is Killed on Master
Submitted:	22 May 2013 3:26	Modified:	11 Sep 2013 4:11
Reporter:	Jervin R	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.5.31	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
When the IO_THREAD is killed on the master, somehow the slave loses state i.e. it does not seem to maintain a connection to the master and times out based on slave_net_timeout.

How to repeat:
Setup a simple master-slave, for testing set slave_net_timeout on slave to 10seconds or shorter then, kill the corresponding IO_THREAD connection on the master. With log_warnings = 2 on the master you should see frequent reconnects of the IO_THREAD.

On the slave:

node2 [localhost] {msandbox} ((none)) > show slave status \G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 55311
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000001
          Read_Master_Log_Pos: 4205
               Relay_Log_File: mysql_sandbox55312-relay-bin.000002
                Relay_Log_Pos: 2302
        Relay_Master_Log_File: mysql-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 4205
              Relay_Log_Space: 2471
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 101
1 row in set (0.00 sec)

node2 [localhost] {msandbox} ((none)) > set global slave_net_timeout = 10;
Query OK, 0 rows affected, 1 warning (0.00 sec)

node2 [localhost] {msandbox} ((none)) > 

On the master:

node1 [localhost] {msandbox} ((none)) > set global log_warnings = 2;
Query OK, 0 rows affected (0.00 sec)

node1 [localhost] {msandbox} ((none)) > show processlist;
+----+-------------+-----------------+------+-------------+------+-----------------------------------------------------------------------------+------------------+
| Id | User        | Host            | db   | Command     | Time | State                                                                       | Info             |
+----+-------------+-----------------+------+-------------+------+-----------------------------------------------------------------------------+------------------+
|  6 | system user |                 | NULL | Connect     |  325 | Waiting for master to send event                                            | NULL             |
|  7 | system user |                 | NULL | Connect     |  325 | Slave has read all relay log; waiting for the slave I/O thread to update it | NULL             |
|  8 | rsandbox    | localhost:44899 | NULL | Binlog Dump |  325 | Master has sent all binlog to slave; waiting for binlog to be updated       | NULL             |
|  9 | msandbox    | localhost       | NULL | Query       |    0 | NULL                                                                        | show processlist |
+----+-------------+-----------------+------+-------------+------+-----------------------------------------------------------------------------+------------------+
4 rows in set (0.00 sec)

node1 [localhost] {msandbox} ((none)) > kill 8;
Query OK, 0 rows affected (0.00 sec)

node1 [localhost] {msandbox} ((none)) > show processlist;
+----+-------------+-----------------+------+-------------+------+-----------------------------------------------------------------------------+------------------+
| Id | User        | Host            | db   | Command     | Time | State                                                                       | Info             |
+----+-------------+-----------------+------+-------------+------+-----------------------------------------------------------------------------+------------------+
|  6 | system user |                 | NULL | Connect     |  337 | Waiting for master to send event                                            | NULL             |
|  7 | system user |                 | NULL | Connect     |  337 | Slave has read all relay log; waiting for the slave I/O thread to update it | NULL             |
|  9 | msandbox    | localhost       | NULL | Query       |    0 | NULL                                                                        | show processlist |
| 10 | rsandbox    | localhost:44906 | NULL | Binlog Dump |    2 | Master has sent all binlog to slave; waiting for binlog to be updated       | NULL             |
+----+-------------+-----------------+------+-------------+------+-----------------------------------------------------------------------------+------------------+
4 rows in set (0.00 sec)

node1 [localhost] {msandbox} ((none)) > \q
Bye
[revin@forge rcsandbox_5_5_31]$ tail -f node1/data/msandbox.err
130521 23:15:33 [Note] Server socket created on IP: '127.0.0.1'.
130521 23:15:33 [Note] Event Scheduler: Loaded 0 events
130521 23:15:33 [Note] /wok/bin/mysql/5.5.31/bin/mysqld: ready for connections.
Version: '5.5.31-log'  socket: '/tmp/mysql_sandbox55311.sock'  port: 55311  MySQL Community Server (GPL)
130521 23:15:36 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=mysql_sandbox55311-relay-bin' to avoid this problem.
130521 23:15:36 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='127.0.0.1', master_port='55312', master_log_file='', master_log_pos='4'.
130521 23:15:36 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './mysql_sandbox55311-relay-bin.000001' position: 4
130521 23:15:36 [Note] Slave I/O thread: connected to master 'rsandbox@127.0.0.1:55312',replication started in log 'FIRST' at position 4
130521 23:21:11 [Note] Start binlog_dump to slave_server(102), pos(mysql-bin.000001, 4205)
130521 23:21:21 [Note] Start binlog_dump to slave_server(102), pos(mysql-bin.000001, 4205)
130521 23:21:31 [Note] Start binlog_dump to slave_server(102), pos(mysql-bin.000001, 4205)
130521 23:21:41 [Note] Start binlog_dump to slave_server(102), pos(mysql-bin.000001, 4205)
^C
[revin@forge rcsandbox_5_5_31]$ 

Suggested fix:
Restarting the slave in this case fixes the problem.

Hello Jervin,

Thank you for the bug report. 
Verified as described on recent 5.5.32 version.

Thanks,
Umesh

This is not a bug. A new dump thread is created for every request from slave IO thread and the replication works fine. The slave IO thread will try to reconnect master per slave_net_timeout. So you saw a reconnection from slave IO thread per 10 seconds, after set global slave_net_timeout = 10.

For slave_net_timeout, see http://dev.mysql.com/doc/refman/5.5/en/replication-options-slave.html
--slave-net-timeout=seconds
The number of seconds to wait for more data from the master before the slave considers the connection broken, aborts the read, and tries to reconnect. The first retry occurs immediately after the timeout. The interval between retries is controlled by the MASTER_CONNECT_RETRY option for the CHANGE MASTER TO statement, and the number of reconnection attempts is limited by the --master-retry-count option. The default is 3600 seconds (one hour).

Dao,

I think this is a bug, when the connection has been re-established, isn't the heartbeat supposed to keep the replication open and prevent the IO_THREAD from getting idle?