MySQL Bugs: #73257: Slave doesn't detect blocking IPTABLES to master server

Bug #73257	Slave doesn't detect blocking IPTABLES to master server
Submitted:	10 Jul 2014 11:08	Modified:	27 Sep 2016 13:15
Reporter:	Shahriyar Rzayev	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S1 (Critical)
Version:	5.6.19	OS:	Linux (CentOS 6.5)
Assigned to:		CPU Architecture:	Any

Description:
Dear experts, 
I have encountered an interesting situation with Master/Slave replication in production server. 
After all i decide to reproduce this in local virtual machines.
Here is all steps that i did for replaying:

1. 
Two Centos 6.5 with official YUM repos from Oracle and installed Two 5.6.19 GA MySQLs.

192.168.1.99 -> master
192.168.1.88 -> slave

2. Here is master server my.cnf portion related to replication:

# BINARY LOGGING #
server_id                      = 1
log_bin                        = /var/lib/mysql_binlogs/mysql-bin
log_bin_index                  = /var/lib/mysql_binlogs/mysql-bin
expire_logs_days               = 14
sync_binlog                    = 1
binlog_format                  = row
gtid-mode                      = on                             
enforce-gtid-consistency       = true           
master-info-repository         = TABLE            
relay-log-info-repository      = TABLE         
slave-parallel-workers         = 2               
binlog-checksum                = CRC32                  
master-verify-checksum         = 1                
slave-sql-verify-checksum      = 1             
binlog-rows-query-log_events   = 1
log_slave_updates              = 1

Slave's my.cnf portion:

# BINARY LOGGING #

server_id                      = 2
log_bin                        = /var/lib/mysql_binlogs/mysql-bin
log_bin_index                  = /var/lib/mysql_binlogs/mysql-bin
expire_logs_days               = 14
sync_binlog                    = 1
binlog_format                  = row
relay_log                      = /var/lib/mysql_binlogs/mysql-relay-bin
log_slave_updates              = 1
read_only                      = 1
gtid-mode                      = on
enforce-gtid-consistency       = true
master-info-repository         = TABLE
relay-log-info-repository      = TABLE
#slave-parallel-workers         = 2
binlog-checksum                = CRC32
master-verify-checksum         = 1
slave-sql-verify-checksum      = 1
binlog-rows-query-log_events   = 1

3. Turn off iptables on both servers:
fabricsrv -> master -> 192.168.1.99

[root@fabricsrv ~]# service iptables status
iptables: Firewall is not running.

linuxsrv2 -> slave -> 192.168.1.88
[root@linuxsrv2 ~]# service iptables status
iptables: Firewall is not running.

4. On master:

mysql> CREATE USER 'repl'@'%' IDENTIFIED BY '$Slavepass45#';
Query OK, 0 rows affected (0,12 sec)

mysql> GRANT REPLICATION SLAVE,REPLICATION CLIENT ON *.* TO 'repl'@'%';
Query OK, 0 rows affected (0,01 sec)

On slave:

CHANGE MASTER TO MASTER_HOST="192.168.1.99", MASTER_USER="repl",MASTER_PASSWORD="$Slavepass45#", MASTER_AUTO_POSITION = 1;

Slave status shows that everything is ok:

Slave_IO_Running: Yes
Slave_SQL_Running: Yes

And simple check:

On master:
mysql> create database test;
Query OK, 1 row affected (0,01 sec)

On both side:

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| test               |
+--------------------+
4 rows in set (0,00 sec)

5. After setting up a replication now turn on iptables on both servers:

[root@fabricsrv ~]# service iptables start
iptables: Applying firewall rules:                         [  OK  ]

[root@linuxsrv2 ~]# service iptables start
iptables: Applying firewall rules:                         [  OK  ]

6. Slave status still shows no error:

Slave_IO_Running: Yes
Slave_SQL_Running: Yes

7. Create several more databases on master:

mysql> create database test1;
Query OK, 1 row affected (0,02 sec)

mysql> create database test2;
Query OK, 1 row affected (0,01 sec)

mysql> create database test3;
Query OK, 1 row affected (0,00 sec)

mysql> create database test4;
Query OK, 1 row affected (0,04 sec)

mysql> create database test5;
Query OK, 1 row affected (0,01 sec)

On Slave:

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| test               |
+--------------------+
4 rows in set (0,00 sec)

So slave did not apply changes from master, however show slave status still shows:

Slave_IO_Running: Yes
Slave_SQL_Running: Yes

Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it.

Interesting thing is  even i stopped iptables on both servers, slave still did not apply changes:

[root@fabricsrv ~]# service iptables stop
iptables: Setting chains to policy ACCEPT: filter          [  OK  ]
iptables: Flushing firewall rules:                         [  OK  ]
iptables: Unloading modules:                               [  OK  ]

[root@linuxsrv2 ~]# service iptables stop
iptables: Setting chains to policy ACCEPT: filter          [  OK  ]
iptables: Flushing firewall rules:                         [  OK  ]
iptables: Unloading modules:                               [  OK  ]

Again from slave:

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| test               |
+--------------------+
4 rows in set (0,00 sec)

The solution was restarting slave:

mysql> stop slave;
Query OK, 0 rows affected (0,03 sec)

mysql> start slave;
Query OK, 0 rows affected (0,01 sec)

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| test               |
| test1              |
| test2              |
| test3              |
| test4              |
| test5              |
+--------------------+
9 rows in set (0,00 sec)

Why i marked this problem  as critical one? So i have searched more than 30 minutes for some issue o error in slave server because all company argued that, they applied changes  from admin panel but there changes do not appear on result.
and after all sys admin says that he is just started iptables for testing something :)

So is there any solution for slave to checking firewall or something when such problem occurs?  Maybe some meaningful error message or smth else?

How to repeat:
All steps provided in description

Suggested fix:
Solution was restarting slave. 
But detecting this problem was not so simple.
some useful slave error message related to firewall will be great.

Can you set the slave_net_timeout lower and re-test?  1 hour is the default.

http://dev.mysql.com/doc/refman/5.6/en/replication-options-slave.html#option_mysqld_slave-...

mysql> set @@global.slave_net_timeout = 3;
Query OK, 0 rows affected, 1 warning (0,00 sec)

mysql> show warnings;
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                                                                                                         |
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Warning | 1704 | The requested value for the heartbeat period exceeds the value of `slave_net_timeout' seconds. A sensible value for the period should be less than the timeout. |
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0,00 sec)

After 3 seconds nothing happened:

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.1.99
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000002
          Read_Master_Log_Pos: 4441
               Relay_Log_File: mysql-relay-bin.000004
                Relay_Log_Pos: 448
        Relay_Master_Log_File: mysql-bin.000002
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 4441
              Relay_Log_Space: 3963
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1
                  Master_UUID: 5112ae7f-081a-11e4-bd07-080027efe012
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 5112ae7f-081a-11e4-bd07-080027efe012:1-28
            Executed_Gtid_Set: 5112ae7f-081a-11e4-bd07-080027efe012:1-28
                Auto_Position: 1
1 row in set (0,00 sec)

The base problem is: as long as a client is only reading from a "dead" socket after connection was already established (other side lost power, cable was cut, routing problem, or firewall rule was added as is the case here) it will just wait for incoming traffic (which obviously won't happen anymore). It will only fail when running into the TCP keepalive timeout (defaults to 2 hours).

The "requested value for the heartbeat period exceeds the value of `slave_net_timeout' seconds." warning probably refers to the MASTER_HEARTBEAT_PERIOD setting in CHANGE MASTER, this defaults
to half the slave_net_timeout value at the time of the first
CHANGE MASTER operation and needs to a lower value explicitly
when changing slave_net_timeout.

So you may want to retry with e.g.:

  CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD=3;
  SET slave_net_timeout=6;

Marking as not a bug.