Description:
Summary:
When simulating a network connectivity loss between replicating servers, if the replication channel uses SSL then MySQL 5.6.21 does not detect the loss of network connectivity, although 5.6.2-m5 does. This is regardless of the value of the "slave_net_timeout" variable.
I tested this using binary .tar.gz builds downloaded from Oracle, for MySQL versions 5.6.2-m5, 5.6.3-m6, and 5.6.21. Between 5.6.2 and 5.6.3, this regression was introduced -- 5.6.2 works as expected. This regression is still present in the current GA release (5.6.21).
This bug was also reported to the MariaDB project as a regression from the 5.5 codebase. See: https://mariadb.atlassian.net/browse/MDEV-7111 for details, including an example my.cnf.
Impact:
This makes replication markedly less reliable when using SSL, since the server cannot recover on its own from certain network-outage events. This makes network-related outage windows artificially longer than they need to be. It also means that, contrary to documentation, the MySQL server is ignoring the slave_net_timeout variable when SSL replication is used, even though "show variables like 'slave_net_timeout' " reports the configured value correctly (e.g. it's not a configuration-parsing problem).
A possible workaround is to use third-party tools (pt-heartbeat and Nagios, for example) to monitor replication variables and issue commands to restart the slave when slave_net_timeout seconds have passed without seeing new traffic, since the MySQL server itself is broken in this regard for versions 5.6.3 and later. Another alternative is to disable SSL for replication. This is not an option for many use cases, due to the need to ensure sensitive business/customer data is secured while in transit over the network. A third option is to stick with MySQL 5.5 if secure replication is important.
How to repeat:
Steps to reproduce (functional case):
- Set up a MySQL slave running 5.6.2-m5. Ensure binlog replication is working and SSL-encrypted.
- Start generating traffic on the master. Watch the slave status to see the traffic is being replicated successfully.
- Simulate a network failure, e.g. "iptables -I INPUT -s <master_ip> -j DROP" on the slave. This drops all network packets from the master host.
- Wait for slave_net_timeout seconds to pass. The slave will restart as documented, and the slave status will now state that it is attempting to reconnect to the master.
Steps to reproduce (broken case):
- Set up a MySQL slave running 5.6.21. Ensure binlog replication is working and SSL-encrypted.
- Start generating traffic on the master. Watch the slave status to see the traffic is being replicated successfully.
- Simulate a network failure, e.g. "iptables -I INPUT -s <master_ip> -j DROP" on the slave. This drops all network packets from the master host.
- Wait for slave_net_timeout seconds to pass. The slave status will continue to state "waiting for master to send event", even though the log position counters are not advancing. The slave will remain in this state until the slave is stopped and restarted – it will not restart on its own, contrary to documentation. This is a change in behavior from MySQL 5.6.2 and 5.5.x, and appears to be incorrect behavior.
Suggested fix:
Ensure the slave_net_timeout variable is honored regardless if SSL is used for replication traffic or not. Specifically, if no bin log traffic has been received from the master in <slave_net_timeout> seconds, then tear down the network connection and attempt to restart the replication slave thread.