Bug #74908 Unable to detect network timeout in 5.6 when using SSL (regression from 5.5)
Submitted: 17 Nov 2014 22:58 Modified: 7 Jul 2015 8:39
Reporter: Paul Kreiner Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.6.21, 5.6.27 OS:Linux (Ubuntu 14.04 LTS x86_64)
Assigned to: CPU Architecture:Any
Tags: replication, SSL

[17 Nov 2014 22:58] Paul Kreiner
Description:
Summary:
When simulating a network connectivity loss between replicating servers, if the replication channel uses SSL then MySQL 5.6.21 does not detect the loss of network connectivity, although 5.6.2-m5 does. This is regardless of the value of the "slave_net_timeout" variable.

I tested this using binary .tar.gz builds downloaded from Oracle, for MySQL versions 5.6.2-m5, 5.6.3-m6, and 5.6.21.  Between 5.6.2 and 5.6.3, this regression was introduced -- 5.6.2 works as expected.  This regression is still present in the current GA release (5.6.21).

This bug was also reported to the MariaDB project as a regression from the 5.5 codebase.  See: https://mariadb.atlassian.net/browse/MDEV-7111 for details, including an example my.cnf.

Impact:
This makes replication markedly less reliable when using SSL, since the server cannot recover on its own from certain network-outage events.  This makes network-related outage windows artificially longer than they need to be.  It also means that, contrary to documentation, the MySQL server is ignoring the slave_net_timeout variable when SSL replication is used, even though "show variables like 'slave_net_timeout' " reports the configured value correctly (e.g. it's not a configuration-parsing problem).

A possible workaround is to use third-party tools (pt-heartbeat and Nagios, for example) to monitor replication variables and issue commands to restart the slave when slave_net_timeout seconds have passed without seeing new traffic, since the MySQL server itself is broken in this regard for versions 5.6.3 and later.  Another alternative is to disable SSL for replication.  This is not an option for many use cases, due to the need to ensure sensitive business/customer data is secured while in transit over the network.  A third option is to stick with MySQL 5.5 if secure replication is important.

How to repeat:
Steps to reproduce (functional case):
- Set up a MySQL slave running 5.6.2-m5.  Ensure binlog replication is working and SSL-encrypted.
- Start generating traffic on the master. Watch the slave status to see the traffic is being replicated successfully.
- Simulate a network failure, e.g. "iptables -I INPUT -s <master_ip> -j DROP" on the slave. This drops all network packets from the master host.
- Wait for slave_net_timeout seconds to pass. The slave will restart as documented, and the slave status will now state that it is attempting to reconnect to the master.

Steps to reproduce (broken case):
- Set up a MySQL slave running 5.6.21.  Ensure binlog replication is working and SSL-encrypted.
- Start generating traffic on the master. Watch the slave status to see the traffic is being replicated successfully.
- Simulate a network failure, e.g. "iptables -I INPUT -s <master_ip> -j DROP" on the slave. This drops all network packets from the master host.
- Wait for slave_net_timeout seconds to pass. The slave status will continue to state "waiting for master to send event", even though the log position counters are not advancing. The slave will remain in this state until the slave is stopped and restarted – it will not restart on its own, contrary to documentation. This is a change in behavior from MySQL 5.6.2 and 5.5.x, and appears to be incorrect behavior.

Suggested fix:
Ensure the slave_net_timeout variable is honored regardless if SSL is used for replication traffic or not.  Specifically, if no bin log traffic has been received from the master in <slave_net_timeout> seconds, then tear down the network connection and attempt to restart the replication slave thread.
[17 Nov 2014 23:03] Paul Kreiner
Based on feedback from the MariaDB team, this particular check-in possibly introduced the bug into the MySQL codebase:

revno: 3134
revision-id: davi.arnaut@oracle.com-20110531135209-8kxz4np8c4gav6s2
parent: jimmy.yang@oracle.com-20110531093059-3x1f93rnspltp3h6
committer: Davi Arnaut <davi.arnaut@oracle.com>
branch nick: 11762221-trunk
timestamp: Tue 2011-05-31 10:52:09 -0300
message:
  Bug#11762221 - 54790: Use of non-blocking mode for sockets limits performance
  Bug#11758972 - 51244: wait_timeout fails on OpenSolaris
<snip>
[8 May 2015 21:33] Kristian McColm
Confirmed on 

# mysqld --version
mysqld  Ver 5.6.24-enterprise-commercial-advanced for Linux on x86_64 (MySQL Enterprise Server - Advanced Edition (Commercial))
[7 Jul 2015 8:39] Umesh Shastry
Hello Paul Kreiner,

Thank you for the report.
Observed this with latest 5.6 builds.

Thanks,
Umesh