MySQL Bugs: #97345: IO Thread not detecting failed master with relay_log_space

Bug #97345	IO Thread not detecting failed master with relay_log_space_limit.
Submitted:	23 Oct 2019 16:27	Modified:	25 Oct 2019 10:56
Reporter:	Jean-François Gagné	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.7.28	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Hi,

when using relay_log_space_limit, Slave_IO_State (as shown by SHOW SLAVE STATUS) can be " Waiting for the slave SQL thread to free enough relay log space".  If the master is stopped at that time, the Slave_IO_Running (also as shown by SHOW SLAVE STATUS) stays to "yes" even if the master is gone.  I would expect the slave to detect its master is gone and report it in Slave_IO_Running.  See "How to repeat" below for more details.

Note: the slave reporting accurately that the master is gone is important for automation tools that are querying slaves to know if their master is working.  An example of such tool is Orchestrator.

Many thanks for looking into that, JFG

How to repeat:
Note: I am using SBR combined with the sleep function to generate lagging slaves.

First, let's show the expected behaviour when not using relay_log_space_limit.

# Create a sandbox:
dbdeployer deploy replication mysql_5.7.28 \
  -c 'sync_binlog = 0' -c 'innodb_flush_log_at_trx_commit = 0' \
  -c 'max_connections = 2000' -c 'binlog_format = STATEMENT' \
  -c 'slave_net_timeout = 4'

# Initialise the test environment:
./m <<< "
  CREATE DATABASE test_jfg;
  CREATE TABLE test_jfg.t(id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, v INTEGER);"

# Load data (need to be done in parallel because of the sleep):
for i in $(seq 1000); do yes "INSERT INTO test_jfg.t(v) value(sleep(0.1));" | head -n 100 | ./m& done; wait

# Check the binlog and relay logs:
ls -lh ./master/data/mysql-bin.0* ./node{1,2}/data/mysql-relay.0*

# From above, I am getting this:
-rw-r----- 1 xxx xxx 31M Oct 23 16:05 ./master/data/mysql-bin.000001
-rw-r----- 1 xxx xxx 203 Oct 23 16:04 ./node1/data/mysql-relay.000001
-rw-r----- 1 xxx xxx 31M Oct 23 16:05 ./node1/data/mysql-relay.000002
-rw-r----- 1 xxx xxx 203 Oct 23 16:04 ./node2/data/mysql-relay.000001
-rw-r----- 1 xxx xxx 31M Oct 23 16:05 ./node2/data/mysql-relay.000002

# Check status of IO Threads on slaves (in 2 different shells):
while date; do ./s1 <<< "SHOW SLAVE STATUS\G" | grep -e Slave_IO_State -e Slave_IO_Running -e Last_IO_Er; sleep 1; done
while date; do ./s2 <<< "SHOW SLAVE STATUS\G" | grep -e Slave_IO_State -e Slave_IO_Running -e Last_IO_Er; sleep 1; done

# Stop the master:
date; ./master/stop

# From the stop command above and the IO Thread status, I am getting the following, so the slaves quickly see that the master is gone.
Wed Oct 23 16:07:30 UTC 2019
stop /home/xxx/sandboxes/rsandbox_mysql_5_7_28/master

[...]
Wed Oct 23 16:07:37 UTC 2019
               Slave_IO_State: Waiting for master to send event
             Slave_IO_Running: Yes
                Last_IO_Errno: 0
                Last_IO_Error:
      Last_IO_Error_Timestamp:
Wed Oct 23 16:07:38 UTC 2019
               Slave_IO_State: Reconnecting after a failed master event read
             Slave_IO_Running: Connecting
                Last_IO_Errno: 2003
                Last_IO_Error: error reconnecting to master 'rsandbox@127.0.0.1:19529' - retry-time: 60  retries: 1
      Last_IO_Error_Timestamp: 191023 16:07:38
[...]

# Let's retry all of above with this different sandbox (adding max_relay_log_size and relay_log_space_limit):
dbdeployer deploy replication mysql_5.7.28 \
  -c 'sync_binlog = 0' -c 'innodb_flush_log_at_trx_commit = 0' \
  -c 'max_connections = 2000' -c 'binlog_format = STATEMENT' \
  -c 'slave_net_timeout = 4' \
  -c "max_relay_log_size = $((1*1024*1024))" -c "relay_log_space_limit = $((4*1024*1024))"

# Same Initialise, Load and Check commands as above, I am getting this for the check, and we can see that the binlogs are not fully downloaded because of relay_log_space_limit:
-rw-r----- 1 xxx xxx  31M Oct 23 16:14 ./master/data/mysql-bin.000001
-rw-r----- 1 xxx xxx  203 Oct 23 16:10 ./node1/data/mysql-relay.000001
-rw-r----- 1 xxx xxx 1.1M Oct 23 16:12 ./node1/data/mysql-relay.000002
-rw-r----- 1 xxx xxx 1.1M Oct 23 16:12 ./node1/data/mysql-relay.000003
-rw-r----- 1 xxx xxx 1.1M Oct 23 16:12 ./node1/data/mysql-relay.000004
-rw-r----- 1 xxx xxx 1.0M Oct 23 16:12 ./node1/data/mysql-relay.000005
-rw-r----- 1 xxx xxx  203 Oct 23 16:10 ./node2/data/mysql-relay.000001
-rw-r----- 1 xxx xxx 1.1M Oct 23 16:12 ./node2/data/mysql-relay.000002
-rw-r----- 1 xxx xxx 1.1M Oct 23 16:12 ./node2/data/mysql-relay.000003
-rw-r----- 1 xxx xxx 1.1M Oct 23 16:12 ./node2/data/mysql-relay.000004
-rw-r----- 1 xxx xxx 1.0M Oct 23 16:12 ./node2/data/mysql-relay.000005

# Checking both slaves, I am getting something like this:
[...]
Wed Oct 23 16:13:35 UTC 2019
               Slave_IO_State: Waiting for the slave SQL thread to free enough relay log space
             Slave_IO_Running: Yes
                Last_IO_Errno: 0
                Last_IO_Error: 
      Last_IO_Error_Timestamp: 
[...]

# After stopping the master, the slaves does not detect their master is gone (even if waiting more than 30 seconds):
Wed Oct 23 16:14:26 UTC 2019
stop /home/xxx/sandboxes/rsandbox_mysql_5_7_28/master

[...]
Wed Oct 23 16:15:12 UTC 2019
               Slave_IO_State: Waiting for the slave SQL thread to free enough relay log space
             Slave_IO_Running: Yes
                Last_IO_Errno: 0
                Last_IO_Error: 
      Last_IO_Error_Timestamp: 
[...]

# If I restart the IO Thread on one of the slave, it will detect the master is gone:
./s1 <<< "STOP SLAVE IO_THREAD; START SLAVE IO_THREAD"

[...]
Wed Oct 23 16:15:27 UTC 2019
               Slave_IO_State: Waiting for the slave SQL thread to free enough relay log space
             Slave_IO_Running: Yes
                Last_IO_Errno: 0
                Last_IO_Error:
      Last_IO_Error_Timestamp:
Wed Oct 23 16:15:28 UTC 2019
               Slave_IO_State: Connecting to master
             Slave_IO_Running: Connecting
                Last_IO_Errno: 2003
                Last_IO_Error: error connecting to master 'rsandbox@127.0.0.1:19529' - retry-time: 60  retries: 1
      Last_IO_Error_Timestamp: 191023 16:15:27
[...]

# While the other slave is still thinking the master is fine, even after more than 2 minutes:
[...]
Wed Oct 23 16:16:35 UTC 2019
               Slave_IO_State: Waiting for the slave SQL thread to free enough relay log space
             Slave_IO_Running: Yes
                Last_IO_Errno: 0
                Last_IO_Error: 
      Last_IO_Error_Timestamp: 
[...]

Suggested fix:
I would expect slaves, even in the "Waiting for the slave SQL thread to free enough relay log space" to quickly detect that their master is gone.

As a side note and in above*mentioned state, it looks like the slaves are not processing heartbeats from the master because if they were, they would see that the master is gone.

Hi,

Thanks for the report. I fully agree with your expectations. Let's see what replication team will have to say :)

all best