Description:
Connection failures that cause the IO-thread to stop lead to a situation where replication does not continue when the connection is restored although SHOW SLAVE STATUS and the slave.err file show replication as resumed and not error messages are provided.
In my case, a few hours after the restart of replication the slave showed:
mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: i5os-01.mysql.com
Master_User: root
Master_Port: 9306
Connect_Retry: 1
Master_Log_File: master-bin.000005
Read_Master_Log_Pos: 641127147
Relay_Log_File: slave-relay-bin.000015
Relay_Log_Pos: 641127174
Relay_Master_Log_File: master-bin.000005
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 641127036
Relay_Log_Space: 641127285
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
1 row in set (0.00 sec)
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| systest1 |
| test |
+--------------------+
4 rows in set (0.00 sec)
While the master showed:
mysql> show master status;
+-------------------+-----------+--------------+------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+-------------------+-----------+--------------+------------------+
| master-bin.000009 | 206637528 | | |
+-------------------+-----------+--------------+------------------+
1 row in set (0.02 sec)
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| omer1 |
| systest1 |
| test |
+--------------------+
5 rows in set (0.00 sec)
Note the differences in the master bin log number and position and the missing database on the slave.
This problems was observed while running replication between different geographical locations (Sweden and Cupertino) and emulating packet losses on the line.
How to repeat:
1. Set replication to run between a master and a slave in different geographical
locations (in this case the specifics was a master in Cupertino replicating
to a slave in Sweden.
2. Emulate packet loss on the line using iptable (the following was set on the
'slave' box):
iptables -A drop_test -p tcp -m limit --limit 1/s --limit-burst 1
--source <master_ip> --source-port <master_port> -j DROP
and point the INPUT/OUTPUT/FORWARD default rules to it
3. After some time (can be hours), replication will stop with a communication
error.
4. Stop the packet drop by removing the above rule:
iptables -D drop_test -p tcp -m limit --limit 1/s --limit-burst 1
--source <master_ip> --source-port <master_port> -j DROP:
5. Restart replication by logging into the client and performing 'STOP SLAVE'
and 'START SLAVE'
6. You will note that replication has started and it shows as running when
using 'SHOW SLAVE STATUS'. However nothing is replicated. The slave is not
catching up on changes done in the master.
For more details check the master and slave log directories (location posted below)
Suggested fix:
Replication should resume and the slave should catch up