MySQL Bugs: #84375: Changing MASTER_DELAY to 0 purges the relay log, causing corruption.

Bug #84375	Changing MASTER_DELAY to 0 purges the relay log, causing corruption.
Submitted:	30 Dec 2016 10:27	Modified:	2 Aug 2017 14:26
Reporter:	Saverio Miroddi	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Documentation	Severity:	S1 (Critical)
Version:	5.7.17	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
When changing the replication delay from a non-zero value to 0, MySQL will cause corruption on the slave, apparently because it purges the relay log.

I've checked the documentation (mainly the two pages about CHANGE MASTER, and Delayed replication), and there is no mention of resets in association with MASTER DELAY.

How to repeat:
Workflow:

- setup a slave with a MASTER_DELAY=900
- execute STOP SLAVE
- execute CHANGE MASTER TO MASTER_DELAY=0
- execute START SLAVE

Slave status before:

          Read_Master_Log_Pos: 43779970
               Relay_Log_File: mysql-relay-bin.000913
                Relay_Log_Pos: 12797792
            Slave_SQL_Running: Yes
                   Last_Errno: 0
                   Last_Error: 
          Exec_Master_Log_Pos: 12797579
              Relay_Log_Space: 43780437
        Seconds_Behind_Master: 899
               Last_SQL_Errno: 0
               Last_SQL_Error: 
                    SQL_Delay: 900
          SQL_Remaining_Delay: 1
      Slave_SQL_Running_State: Waiting until MASTER_DELAY seconds after master executed event
     Last_SQL_Error_Timestamp: 

Slave status after:

          Read_Master_Log_Pos: 44869872
               Relay_Log_File: mysql-relay-bin.000002
                Relay_Log_Pos: 869
            Slave_SQL_Running: No
                   Last_Errno: 1032
                   Last_Error: Could not execute Update_rows event on table mydb.mytable; Can't find record in 'mytable', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.003470, end_log_pos 44068724
          Exec_Master_Log_Pos: 44065228
              Relay_Log_Space: 805720
        Seconds_Behind_Master: NULL
               Last_SQL_Errno: 1032
               Last_SQL_Error: Could not execute Update_rows event on table mydb.mytable; Can't find record in 'mytable', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.003470, end_log_pos 44068724
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
     Last_SQL_Error_Timestamp: 161230 09:59:33

Relay log files after:

    207 mysql-relay-bin.000001
3587081 mysql-relay-bin.000002
     78 mysql-relay-bin.index

Suggested fix:
There is a possibility that this is a documentation issue, although I don't think this is the case, because this behavior prevents (without trickery) changing a slave from delayed to non-delayed.

Thank you for the bug report. Please first check with the latest MySQL released binary 5.7.17 and comment here if you get the same issue. Thanks.

For reference, changed the version to 5.7.17 (Community Edition).

Proposing the severity to Critical, and in certain setups, this bug will lead to catastrophic results.

Reproduced on MySQL Community edition 5.7.17.

It's very easily reproducible:

1. set a large enough delay
2. create a table on the master
3. make sure the update is replicated to the relay log
4. reset the delay
5. insert a record in the table on the master
6. slave will fail the replication.

based on my test, the slave will reset the relay log on 4., which contains the table creation statement. 5. will therefore cause the slave to fail the replication, since it (the slave) won't replay the creation statement, but only the insertion.

>>based on my test, the slave will reset the relay log on 4., which contains the table creation statement. 5. will therefore cause the slave to fail the replication, since it (the slave) won't replay the creation statement, but only the insertion.

Prior to MySQL 5.7.4, CHANGE MASTER TO deletes all relay log files and starts a new one, unless you specify RELAY_LOG_FILE or RELAY_LOG_POS. In that case, relay log files are kept; the relay_log_purge global variable is set silently to 0. In MySQL 5.7.4 and later, relay logs are preserved when neither the slave SQL thread nor the slave I/O thread is stopped; if both threads are stopped, all relay log files are deleted unless you at least one of RELAY_LOG_FILE or RELAY_LOG_POS is specified. 

In your case, after step 4 - new relay log is generated(i.e slave server creates a new relay log file each time the I/O thread starts). So instead of

STOP SLAVE;
CHANGE MASTER TO MASTER_DELAY=<value>;
START SLAVE;

it is enough to do the following with this feature.

STOP SLAVE SQL_THREAD;
CHANGE MASTER TO MASTER_DELAY=<value>;
START SLAVE SQL_THREAD;

Please see more details on this here http://mysqlserverteam.com/mysql-5-7-4-change-master-without-stopping-slave-altogether/

For example:

-- relay log is rotated WITH STOP SLAVE

root@localhost [(none)]> \! ls -l slave/hod03-relay-bin*
-rw-r----- 1 umshastr common 177 Jan  3 11:48 slave/hod03-relay-bin.000001
-rw-r----- 1 umshastr common  25 Jan  3 11:48 slave/hod03-relay-bin.index
root@localhost [(none)]>
root@localhost [(none)]> STOP SLAVE;
Query OK, 0 rows affected, 1 warning (0.00 sec)

root@localhost [(none)]> CHANGE MASTER TO MASTER_DELAY=900;
Query OK, 0 rows affected (0.00 sec)

root@localhost [(none)]> START SLAVE;
Query OK, 0 rows affected (0.01 sec)

root@localhost [(none)]> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: localhost
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: master-bin.000001
          Read_Master_Log_Pos: 5551
               Relay_Log_File: hod03-relay-bin.000002
                Relay_Log_Pos: 5766
        Relay_Master_Log_File: master-bin.000001
             Slave_IO_Running: Yes
.

root@localhost [(none)]>  \! ls -l slave/hod03-relay-bin*
-rw-r----- 1 umshastr common  207 Jan  3 11:49 slave/hod03-relay-bin.000001
-rw-r----- 1 umshastr common 5766 Jan  3 11:49 slave/hod03-relay-bin.000002
-rw-r----- 1 umshastr common   50 Jan  3 11:49 slave/hod03-relay-bin.index

-- relay log is not rotated WITH STOP SLAVE

root@localhost [(none)]>  \! ls -l slave/hod03-relay-bin*
-rw-r----- 1 umshastr common  207 Jan  3 11:49 slave/hod03-relay-bin.000001
-rw-r----- 1 umshastr common 5766 Jan  3 11:49 slave/hod03-relay-bin.000002
-rw-r----- 1 umshastr common   50 Jan  3 11:49 slave/hod03-relay-bin.index
root@localhost [(none)]> STOP SLAVE SQL_THREAD;
Query OK, 0 rows affected (0.00 sec)

root@localhost [(none)]> CHANGE MASTER TO MASTER_DELAY=0;
Query OK, 0 rows affected (0.00 sec)

root@localhost [(none)]> START SLAVE SQL_THREAD;
Query OK, 0 rows affected (0.00 sec)

root@localhost [(none)]>  \! ls -l slave/hod03-relay-bin*
-rw-r----- 1 umshastr common  207 Jan  3 11:49 slave/hod03-relay-bin.000001
-rw-r----- 1 umshastr common 5766 Jan  3 11:49 slave/hod03-relay-bin.000002
-rw-r----- 1 umshastr common   50 Jan  3 11:49 slave/hod03-relay-bin.index
root@localhost [(none)]>

Could you please check this at your end?

Thanks,
Umesh

OK, this works, thanks - not an issue, then.

I suggest, for clarity, to add a brief explanation to the "Delayed Replication" chapter (https://dev.mysql.com/doc/refman/5.7/en/replication-delayed.html).

Thank you for the feedback!

Looks like this is a duplicate of:
Bug #84249 	CHANGE MASTER TO MASTER_DELAY can lead a slave to skip transactions

Affected by same bug.

Posted by developer:
 
Confirmed with development that the problem behavior is now removed by the fix for
Changing master_delay after stop slave results in loss of events
https://bugs.mysql.com/bug.php?id=81232
which was implemented in the 5.7.19 and 8.0.2 releases. The fix means that issuing CHANGE MASTER TO does not purge the relay log even if both threads are stopped.
Added a "See also" reference to the changelog entry for that bug.