Bug #84249 CHANGE MASTER TO MASTER_DELAY can lead a slave to skip transactions
Submitted: 18 Dec 2016 18:08 Modified: 24 May 2017 12:34
Reporter: Sergio Roysen Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.7 OS:Any
Assigned to: CPU Architecture:Any
Tags: replication

[18 Dec 2016 18:08] Sergio Roysen
Description:
With traditional replication (no GTID) and using 5.7.16

Using CHANGE MASTER TO MASTER_DELAY in a slave could lead to the SQL thread to skip transactions if both the IO_THREAD and the SQL_THREAD had been stopped and the SQL_THREAD has pending transactions to apply from the relay logs.

According to the documentation for 5.7:
Prior to MySQL 5.7.4, CHANGE MASTER TO deletes all relay log files and starts a new one, unless you specify RELAY_LOG_FILE or RELAY_LOG_POS. In that case, relay log files are kept; the relay_log_purge global variable is set silently to 0. In MySQL 5.7.4 and later, relay logs are preserved when neither the slave SQL thread nor the slave I/O thread is stopped; if both threads are stopped, all relay log files are deleted unless you at least one of RELAY_LOG_FILE or RELAY_LOG_POS is specified.

That is actually the case, and the local relay logs are discarded when a `CHANGE MASTER TO MASTER_DELAY` is applied.
But unlike other CHANGE MASTER commands where under the same conditions, `MASTER_LOG_FILE` and `MASTER_LOG_POS` are changed to reflect the values of Relay_Master_Log_File and Exec_Master_Log_Pos, in this case, those values are left untouched.
This will result in the SQL thread to skip all the transactions that it still had to apply from the relay log files that were discarded.

How to repeat:
How to reproduce:

Using a simple replication pair. 
The test is going to stop the SQL thread and the IO thread in two separate steps, to make sure that the SQL thread has pending transactions to run in the relay logs. That's almost guaranteed in a production environment under heavy load when just running `STOP SLAVE`.

# On the master:

CREATE DATABASE db_test;
USE db_test;
CREATE TABLE sequential (id int(11) NOT NULL AUTO_INCREMENT, created_at datetime, PRIMARY KEY (id)) ENGINE=InnoDB;
INSERT INTO sequential (created_at) VALUES (NOW());

# On the slave:
STOP SLAVE SQL_THREAD;

# On the master:
INSERT INTO sequential (created_at) VALUES (NOW());

# On the slave, after it retrieved the latest transaction from the master
STOP SLAVE IO_THREAD;

# Runing SHOW SLAVE STATUS here will show that the IO thread is aiming to binlog position _Y_ from the master (Master_log_pos)
# and the SQL thread is aiming to position _X_ (Exec_Master_Log_Pos)

# Apply a delay, any delay, even zero.
CHANGE MASTER TO MASTER_DELAY = 0;

# Running SHOW SLAVE STATUS here will show that relay logs had been reseted (Relay_Log_File:Relay_Log_Pos), and it will also show that
# the IO thread is still aiming to the same binlog position _Y_ from the master (Master_log_pos) and now the SQL thread is now aiming
# to that position as well. This will result on the SQL thread to skip all the transactions between X and Y

START SLAVE;

Under the above conditions, the slave is going to miss the second record added in the master.

Suggested fix:
Suggested fix:
As with analog `CHANGE MASTER ... ` commands, when MySQL discards the local relay logs it should apply to MASTER_LOG_FILE and MASTER_LOG_POS the binlog positions that are relevant to the SQL thread: Relay_Master_Log_File and Exec_Master_Log_Pos
[20 Dec 2016 11:22] MySQL Verification Team
Hi,

Thanks for your report, verified as described.
[2 Mar 2017 15:24] Saverio Miroddi
Duplicate of https://bugs.mysql.com/bug.php?id=84375 (Changing MASTER_DELAY to 0 purges the relay log, causing corruption).

It's not an issue, but something that's it's not very obviously pointed out in the documentation - the I/O thread must not be stopped, in this context, when changing the master.

This is the reply from support:

> In your case, after step 4 - new relay log is generated(i.e slave server creates a new relay log file each time the I/O thread starts). So instead of
> 
> STOP SLAVE;
> CHANGE MASTER TO MASTER_DELAY=<value>;
> START SLAVE;
> 
> it is enough to do the following with this feature.
> 
> STOP SLAVE SQL_THREAD;
> CHANGE MASTER TO MASTER_DELAY=<value>;
> START SLAVE SQL_THREAD;
> 
> Please see more details on this here http://mysqlserverteam.com/mysql-5-7-4-change-master-without-stopping-slave-altogether/
[4 Mar 2017 14:58] Sergio Roysen
I consider this a dangerous bug.

According to the official documentation, relay logs are going to be deleted every time that a `CHANGE MASTER ....` command is issued while both slave threads are stopped.

There are other `CHANGE MASTER...` commands where that deletion is done in a safe way, and the values of `MASTER_LOG_FILE` and `MASTER_LOG_FILE` are changed accordingly to guaranteed that any transactions that had not yet been applied by the sql_thread are going to be retrieved from the master again.

You can verify that quite easily:

Follow the step described above to reproduce the bug, but this time, instead of running `CHANGE MASTER TO MASTER_DELAY = 0;` run `CHANGE MASTER TO MASTER_RETRY_COUNT=20;`.

You will see that the relay logs are purged, but you will also see that the master bin log positions have been modified so that there are not going to be any missing transactions when you start the slave threads again.

Regarding the documentation that you were refereed to:

I don't consider http://mysqlserverteam.com/mysql-5-7-4-change-master-without-stopping-slave-altogether/ as official documentation of the behavior of MySQL. It is a very informative and interesting blog article about a new feature (been able to stop only one of the two slave threads for some operations), but not documentation.

When that article refers to stopping the slave thread when changing MASTER_DELAY, it says:

> it is enough to do the following with this feature.

That is clearly misleading and should say instead: 

`You MUST do the following with this feature`.

The same article says at the bottom:

> SIDE-EFFECTS?
>
> While implementing this, we have taken special care to make sure we dont
> break anything for a user switching masters like:
>
> STOP SLAVE;
> CHANGE MASTER to <master_def>;
> START SLAVE.
> There are absolutely NO side-effects to worry you. 

As it was shown here and in the other bug report, that is clearly not the case.
Things can and will break.
[24 May 2017 12:34] Erlend Dahl
Duplicate of

Bug#81232 Changing master_delay after stop slave results in loss of events.
[24 Jul 2017 19:18] Shlomi Noach
Affected by same bug.