Description:
Hi,
On a stopped delayed slave, setting MASTER_DELAY to 0 wipes the relay-log. I would expect only the delay to be changed and, once restarted, the slave to process its pre-downloaded relay logs. See How to repeat below for details.
This is particularly problematic on a delayed slave of a crashed master. If this is the only slave of the master and we want to failover to it, I would expect setting MASTER_DELAY to 0 would quickly process all the relay logs. With this bug, I would end-up wiping my relay logs and have no good ways to recover the data.
There is a related ticket open: Bug#87249. But it is classified as a feature request. I consider this to be a serious bug because of explanation in previous paragraph.
Note that if I leave the IO Thread running (by using STOP SLAVE SQL_THREAD instead of STOP SLAVE in How to repeat below), things work as expected. However, this is very specific knowledge and this behaviour can be considered a trap for new MySQL users and DBA. Also, it is not always possible to have, or keep having, the IO Thread running (the IO Thread can stop running after connections retries, and I could describe, and think of, other situations where the IO Thread would be stopped). So keeping the IO Thread running is not a satisfactory way of changing the delay of a delayed slave.
Many thanks for looking into that,
JFG
How to repeat:
$ dbdeployer deploy replication mysql_5.7.24
$ cd $sandboxdir
$ ./s1 <<< "STOP SLAVE; CHANGE MASTER TO MASTER_DELAY = 30; START SLAVE"; \
./m <<< "CREATE DATABASE test_jfg; CREATE TABLE test_jfg.t1(id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, t TIMESTAMP)"
# Note that if I use STOP SLAVE SQL_THREAD instead of STOP SLAVE below, things work as expected.
$ for i in $(seq 40); do sleep 1.1; ./m <<< "INSERT INTO test_jfg.t1(t) VALUE (NOW())"; done; \
./m <<< "SELECT MAX(id) FROM test_jfg.t1; SET GLOBAL offline_mode = ON"; \
ls -1 node1/data/mysql-relay.0*; \
./s1 <<< "SHOW SLAVE STATUS\G STOP SLAVE; CHANGE MASTER TO MASTER_DELAY = 0; START SLAVE"
MAX(id)
40
node1/data/mysql-relay.000001
node1/data/mysql-relay.000002
*************************** 1. row ***************************
Slave_IO_State: Reconnecting after a failed master event read
Master_Host: 127.0.0.1
Master_User: rsandbox
Master_Port: 19125
Connect_Retry: 60
Master_Log_File: mysql-bin.000001
Read_Master_Log_Pos: 15482
Relay_Log_File: mysql-relay.000002
Relay_Log_Pos: 4288
Relay_Master_Log_File: mysql-bin.000001
Slave_IO_Running: Connecting
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 8057
Relay_Log_Space: 11916
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 29
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 3032
Last_IO_Error: error reconnecting to master 'rsandbox@127.0.0.1:19125' - retry-time: 60 retries: 1
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 100
Master_UUID: 00019125-1111-1111-1111-111111111111
Master_Info_File: /home/jgagne/sandboxes/rsandbox_mysql_5_7_24/node1/data/master.info
SQL_Delay: 30
SQL_Remaining_Delay: 1
Slave_SQL_Running_State: Waiting until MASTER_DELAY seconds after master executed event
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp: 190107 16:48:06
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set:
Auto_Position: 0
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
# In above, we can see that we have 2 relays logs and that the slave is pointing in the 2nd.
# We can also see that we haver 40 rows in t1 on the master.
# Let's check the data on the slave, we expect 40 rows, be we have less.
$ ./s1 <<< "SELECT MAX(id) FROM test_jfg.t1"
MAX(id)
13
# We also do not have the 2 relay logs anymore.
$ ls -1 node1/data/mysql-relay.0*
node1/data/mysql-relay.000001
Suggested fix:
Do not wipe relay logs when changing MASTER_DELAY.
I would also expect the same from many other parameters from CHANGE MASTER TO, like (not tested):
-MASTER_USER and MASTER_PASSWORD,
-MASTER_CONNECT_RETRY, MASTER_RETRY_COUNT and MASTER_HEARTBEAT_PERIOD
-MASTER_AUTO_POSITION = {0|1}
-MASTER_SSL* and MASTER_TLS_VERSION = 'protocol_list'
-IGNORE_SERVER_IDS
-maybe more