MySQL Bugs: #34568: Replication does not resume after a connection failiure

Bug #34568	Replication does not resume after a connection failiure
Submitted:	14 Feb 2008 21:31	Modified:	19 Jul 2008 23:09
Reporter:	Omer Barnir (OCA)	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.0.56	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
Connection failures that cause the IO-thread to stop lead to a situation where replication does not continue when the connection is restored although SHOW SLAVE STATUS and the slave.err file show replication as resumed and not error messages are provided.

In my case, a few hours after the restart of replication the slave showed:
mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State: Waiting for master to send event
                Master_Host: i5os-01.mysql.com
                Master_User: root
                Master_Port: 9306
              Connect_Retry: 1
            Master_Log_File: master-bin.000005
        Read_Master_Log_Pos: 641127147
             Relay_Log_File: slave-relay-bin.000015
              Relay_Log_Pos: 641127174
      Relay_Master_Log_File: master-bin.000005
           Slave_IO_Running: Yes
          Slave_SQL_Running: Yes
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 641127036
            Relay_Log_Space: 641127285
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: 0
1 row in set (0.00 sec)

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| systest1           |
| test               |
+--------------------+
4 rows in set (0.00 sec)

While the master showed:
mysql> show master status;
+-------------------+-----------+--------------+------------------+
| File              | Position  | Binlog_Do_DB | Binlog_Ignore_DB |
+-------------------+-----------+--------------+------------------+
| master-bin.000009 | 206637528 |              |                  |
+-------------------+-----------+--------------+------------------+
1 row in set (0.02 sec)

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| omer1              |
| systest1           |
| test               |
+--------------------+
5 rows in set (0.00 sec)

Note the differences in the master bin log number and position and the missing database on the slave.

This problems was observed while running replication between different geographical locations (Sweden and Cupertino) and emulating packet losses on the line.

How to repeat:
1. Set replication to run between a master and a slave in different geographical 
   locations (in this case the specifics was a master in Cupertino replicating   
   to a slave in Sweden. 

2. Emulate packet loss on the line using iptable (the following was set on the 
   'slave' box):
      iptables -A drop_test -p tcp -m limit --limit 1/s --limit-burst 1 
      --source <master_ip> --source-port <master_port> -j DROP
   and point the INPUT/OUTPUT/FORWARD default rules to it
   
3. After some time (can be hours), replication will stop with a communication
   error.

4. Stop the packet drop by removing the above rule:
   iptables -D drop_test -p tcp -m limit --limit 1/s --limit-burst 1 
   --source <master_ip> --source-port <master_port> -j DROP:

5. Restart replication by logging into the client and performing 'STOP SLAVE'
   and 'START SLAVE'

6. You will note that replication has started and it shows as running when 
   using 'SHOW SLAVE STATUS'. However nothing is replicated. The slave is not 
   catching up on changes done in the master.

For more details check the master and slave log directories (location posted below)

Suggested fix:
Replication should resume and the slave should catch up

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".