MySQL Bugs: #93843: Setting MASTER_DELAY to 0 on a stopped delayed slave wipes the relay-log.

Bug #93843	Setting MASTER_DELAY to 0 on a stopped delayed slave wipes the relay-log.
Submitted:	7 Jan 2019 17:21	Modified:	10 Jan 2019 16:00
Reporter:	Jean-François Gagné	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.7.24	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Hi,

On a stopped delayed slave, setting MASTER_DELAY to 0 wipes the relay-log.  I would expect only the delay to be changed and, once restarted, the slave to process its pre-downloaded relay logs.  See How to repeat below for details.

This is particularly problematic on a delayed slave of a crashed master.  If this is the only slave of the master and we want to failover to it, I would expect setting MASTER_DELAY to 0 would quickly process all the relay logs.  With this bug, I would end-up wiping my relay logs and have no good ways to recover the data.

There is a related ticket open: Bug#87249.  But it is classified as a feature request.  I consider this to be a serious bug because of explanation in previous paragraph.

Note that if I leave the IO Thread running (by using STOP SLAVE SQL_THREAD instead of STOP SLAVE in How to repeat below), things work as expected.  However, this is very specific knowledge and this behaviour can be considered a trap for new MySQL users and DBA.  Also, it is not always possible to have, or keep having, the IO Thread running (the IO Thread can stop running after connections retries, and I could describe, and think of, other situations where the IO Thread would be stopped).  So keeping the IO Thread running is not a satisfactory way of changing the delay of a delayed slave.

Many thanks for looking into that,

JFG

How to repeat:
$ dbdeployer deploy replication mysql_5.7.24

$ cd $sandboxdir

$ ./s1 <<< "STOP SLAVE; CHANGE MASTER TO MASTER_DELAY = 30; START SLAVE"; \
  ./m <<< "CREATE DATABASE test_jfg; CREATE TABLE test_jfg.t1(id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, t TIMESTAMP)"

# Note that if I use STOP SLAVE SQL_THREAD instead of STOP SLAVE below, things work as expected.
$ for i in $(seq 40); do sleep 1.1; ./m <<< "INSERT INTO test_jfg.t1(t) VALUE (NOW())"; done; \
  ./m <<< "SELECT MAX(id) FROM test_jfg.t1; SET GLOBAL offline_mode = ON"; \
  ls -1 node1/data/mysql-relay.0*; \
  ./s1 <<< "SHOW SLAVE STATUS\G STOP SLAVE; CHANGE MASTER TO MASTER_DELAY = 0; START SLAVE"
MAX(id)
40
node1/data/mysql-relay.000001
node1/data/mysql-relay.000002
*************************** 1. row ***************************
               Slave_IO_State: Reconnecting after a failed master event read
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 19125
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000001
          Read_Master_Log_Pos: 15482
               Relay_Log_File: mysql-relay.000002
                Relay_Log_Pos: 4288
        Relay_Master_Log_File: mysql-bin.000001
             Slave_IO_Running: Connecting
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 8057
              Relay_Log_Space: 11916
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 29
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 3032
                Last_IO_Error: error reconnecting to master 'rsandbox@127.0.0.1:19125' - retry-time: 60  retries: 1
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 100
                  Master_UUID: 00019125-1111-1111-1111-111111111111
             Master_Info_File: /home/jgagne/sandboxes/rsandbox_mysql_5_7_24/node1/data/master.info
                    SQL_Delay: 30
          SQL_Remaining_Delay: 1
      Slave_SQL_Running_State: Waiting until MASTER_DELAY seconds after master executed event
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 190107 16:48:06
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 
                Auto_Position: 0
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 

# In above, we can see that we have 2 relays logs and that the slave is pointing in the 2nd.
# We can also see that we haver 40 rows in t1 on the master.

# Let's check the data on the slave, we expect 40 rows, be we have less.
$ ./s1 <<< "SELECT MAX(id) FROM test_jfg.t1"
MAX(id)
13

# We also do not have the 2 relay logs anymore.
$ ls -1 node1/data/mysql-relay.0*
node1/data/mysql-relay.000001

Suggested fix:
Do not wipe relay logs when changing MASTER_DELAY.

I would also expect the same from many other parameters from CHANGE MASTER TO, like (not tested):
-MASTER_USER and MASTER_PASSWORD,
-MASTER_CONNECT_RETRY, MASTER_RETRY_COUNT and MASTER_HEARTBEAT_PERIOD
-MASTER_AUTO_POSITION = {0|1}
-MASTER_SSL* and MASTER_TLS_VERSION = 'protocol_list'
-IGNORE_SERVER_IDS
-maybe more

Duplicate of bug #87249

Hello Jean-François,

Thank you for the report and feedback.
I agree with Sveta, this is indeed duplicate of Bug#87249 which is handled as FR. I'll mark this as a duplicate of Bug#87249 but fully agree with your conclusion i.e as you explained in the description i.e "serious bug because one would just end-up wiping relay logs and have no good ways to recover the data.". If you are okay then I'll discuss with Bogdan on this and push for converting Bug#87249 to bug. Thank you!

regards,
Umesh

5.7.24 - test results

Attachment: 93843_5.7.24.results (application/octet-stream, text), 4.55 KiB.

Thanks for your analysis Umesh.
Ok for what you write, I am already subscribed to Bug#87249 and will follow progress there.
Cheer, JFG

Thank you for confirming, our Bogdan has re-classified base bug now.

regards,
Umesh