Bug #72450 Replication down after disabling / enabling network card adapter
Submitted: 25 Apr 2014 8:46 Modified: 16 Jul 2014 14:07
Reporter: Simon Jonhson Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.6.16 OS:Windows (Windows Server 2008 R2 Standard)
Assigned to: CPU Architecture:Any

[25 Apr 2014 8:46] Simon Jonhson
Description:
Hi,

I first created a topic on forums.mysql and they told me to come here (see http://forums.mysql.com/read.php?26,612111,612111#msg-612111).

Here is my issue : 

When I disable the network adapter of the primary server inside the vSphere client and enable it after few minutes (to simulate network outage), the replication does not work anymore.

How to repeat:
Environment :

- 2 virtual machines on vSphere
- MySQL Server 5.6 installed on both of them

Prerequisites :

- MySQL replication has been correctly set and is working both ways (primary <-> secondary) 

Primary server : 

The changes I have done inside "my.ini" : 

replicate-do-db = dbipt_rc74_25102
binlog-do-db = dbipt_rc74_25102
log-bin = mysql-bin
server-id = 1
binlog-format = ROW

MySQL queries to setup the replication : 

FLUSH TABLES WITH READ LOCK;
SHOW MASTER STATUS;
UNLOCK TABLES;
STOP SLAVE;

CHANGE MASTER TO MASTER_HOST='10.14.4.21', 
	MASTER_USER='replication', 
	MASTER_PASSWORD = 'password', 
	MASTER_LOG_FILE = 'mysql-bin.000002', 
	MASTER_LOG_POS = 44357;

START SLAVE;

Secondary server : 

The changes I have done inside "my.ini" : 

replicate-do-db = dbipt_rc74_25102
binlog-do-db = dbipt_rc74_25102
log-bin = mysql-bin
server-id = 2
binlog-format = ROW

MySQL queries to setup the replication : 

FLUSH TABLES WITH READ LOCK;
SHOW MASTER STATUS;
UNLOCK TABLES;
STOP SLAVE;

CHANGE MASTER TO MASTER_HOST='10.14.4.20', 
	MASTER_USER='replication', 
	MASTER_PASSWORD = 'password', 
	MASTER_LOG_FILE = 'mysql-bin.000002', 
	MASTER_LOG_POS = 908024;

START SLAVE;

Scenario 1

1) Disable the network card adapter on vSphere for the primary server
2) Wait 2 minutes
3) Add 1000 records inside the secondary database
4) Enable the network card adapter on vSphere for the primary server

Result : The data have been replicated on the primary server and the replication is still up.

Scenario 2

1) Disable the network card adapter on vSphere for the primary server
2) Wait 1 minutes
3) Add 1 record inside the secondary database
4) Wait 1 minute
4) Enable the network card adapter on vSphere for the primary server

Result :

- The data have not been replicated on the primary server and the replication is down

- "SHOW SLAVE STATUS;" command on the primary server :

"Slave_IO_State" "Master_Host" "Master_User" "Master_Port" "Connect_Retry" "Master_Log_File" "Read_Master_Log_Pos" "Relay_Log_File" "Relay_Log_Pos" "Relay_Master_Log_File" "Slave_IO_Running" "Slave_SQL_Running" "Replicate_Do_DB" "Replicate_Ignore_DB" "Replicate_Do_Table" "Replicate_Ignore_Table" "Replicate_Wild_Do_Table" "Replicate_Wild_Ignore_Table" "Last_Errno" "Last_Error" "Skip_Counter" "Exec_Master_Log_Pos" "Relay_Log_Space" "Until_Condition" "Until_Log_File" "Until_Log_Pos" "Master_SSL_Allowed" "Master_SSL_CA_File" "Master_SSL_CA_Path" "Master_SSL_Cert" "Master_SSL_Cipher" "Master_SSL_Key" "Seconds_Behind_Master" "Master_SSL_Verify_Server_Cert" "Last_IO_Errno" "Last_IO_Error" "Last_SQL_Errno" "Last_SQL_Error" "Replicate_Ignore_Server_Ids" "Master_Server_Id"
"Waiting for master to send event" "10.14.4.21" "repl" "3306" "60" "mysql-bin.000003" "3346131" "STU1-TSS-1-relay-bin.000003" "1788525" "mysql-bin.000003" "Yes" "Yes" "dbipt_rc74_25102" "" "" "" "" "" "0" "" "0" "3346131" "3189943" "None" "" "0" "No" "" "" "" "" "" "0" "No" "0"
"" "0" "" "" "2"

- "SHOW SLAVE STATUS;" command on the secondary server :

"Slave_IO_State" "Master_Host" "Master_User" "Master_Port" "Connect_Retry" "Master_Log_File" "Read_Master_Log_Pos" "Relay_Log_File" "Relay_Log_Pos" "Relay_Master_Log_File" "Slave_IO_Running" "Slave_SQL_Running" "Replicate_Do_DB" "Replicate_Ignore_DB" "Replicate_Do_Table" "Replicate_Ignore_Table" "Replicate_Wild_Do_Table" "Replicate_Wild_Ignore_Table" "Last_Errno" "Last_Error" "Skip_Counter" "Exec_Master_Log_Pos" "Relay_Log_Space" "Until_Condition" "Until_Log_File" "Until_Log_Pos" "Master_SSL_Allowed" "Master_SSL_CA_File" "Master_SSL_CA_Path" "Master_SSL_Cert" "Master_SSL_Cipher" "Master_SSL_Key" "Seconds_Behind_Master" "Master_SSL_Verify_Server_Cert" "Last_IO_Errno" "Last_IO_Error" "Last_SQL_Errno" "Last_SQL_Error" "Replicate_Ignore_Server_Ids" "Master_Server_Id"
"Waiting for master to send event" "10.14.4.20" "repl" "3306" "60" "mysql-bin.000005" "989645" "STU1-TSS-2-relay-bin.000002" "932254" "mysql-bin.000005" "Yes" "Yes" "dbipt_rc74_25102" "" "" "" "" "" "0" "" "0" "989645" "932415" "None" "" "0" "No" "" "" "" "" "" "0" "No" "0" "" "0" "" "" "1"

I know it's difficult to read but the main information on both servers are :

- Slave_IO_State : Waiting for master to send event
- Slave_IO_Running : Yes
- Slave_SQL_Running : Yes

So the issue seems to occurs when entries are added during the network outage and not just before resolving it.

I also tried to set the key "slave-net-timeout = 30" but with no avail, the replication is still down after disabling / enabling the network card adapter.
[25 Apr 2014 8:50] Simon Jonhson
Primary "my.ini" file

Attachment: Primary_my.ini (application/octet-stream, text), 14.04 KiB.

[25 Apr 2014 8:50] Simon Jonhson
Secondary "my.ini" file

Attachment: Secondary_my.ini (application/octet-stream, text), 14.04 KiB.

[25 Apr 2014 9:12] Simon Jonhson
I forgot to mention but to solve this issue, I have to perform a "STOP SLAVE;" and "START SLAVE;" queries on both servers.

After that, everything is back to normal -> replication is working both ways.
[25 Apr 2014 20:24] Simon Jonhson
Hi, 

Using the parameter "slave-net-timeout = 30" inside the "my.ini" files seems to work now.

So the scenario is the following : 

1) Disable the network card adapter on vSphere for the primary server
2) Wait 5 minutes
3) Add 1 record inside the secondary database
4) Wait 5 minutes
4) Enable the network card adapter on vSphere for the primary server

Result : On step 5, the replication is fully working after maximum 30 seconds.

Note : 

My issue is solved but I wonder why the parameter is set to 3600 (1 hour) by default? If I didn't change it, it'll take 1 hour for the replication to work again and in a production environment, this is not acceptable.
[28 Apr 2014 15:01] Simon Jonhson
Severity has been changed according to the workaround.
[16 Jul 2014 14:07] MySQL Verification Team
Thank you for taking the time to write to us, but this is not a bug.
Replication is asynchronous - slaves need not be connected permanently to receive updates from the master. This means that updates can occur over long-distance connections and even over temporary or intermittent connections such as a dial-up service. Depending on the configuration, you can replicate all databases, selected databases, or even selected tables within a database

This is by design, hence some of the parameters needs to be adjusted as per the environment/requirement etc..

It is explained here - http://dev.mysql.com/doc/refman/5.6/en/replication-options-slave.html#option_mysqld_slave-... first re connection happens when slave-net-timeout is due and only after it slave tries to reconnect every master-connect-retry seconds. Default value for slave-net-timeout is 3600 seconds.. 

Also, see similar issues reported - Bug #11256, Bug #47721, Bug #21491