Bug #87102 Changing MASTER_DELAY to 0 purges all binlogs and makes recovery impossible
Submitted: 18 Jul 2017 4:23 Modified: 20 Sep 2017 6:22
Reporter: Mikhail Golenkov Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.6.36 OS:CentOS (CentOS 6,7. x86_64)
Assigned to: Venkatesh Duggirala CPU Architecture:Any
Tags: replication delay recovery

[18 Jul 2017 4:23] Mikhail Golenkov
Description:
Executing "CHANGE MASTER TO MASTER_DELAY = 0" on a slave purges all binary logs that have been downloaded from master. 
When the master crashes, all transactions should be executed on a slave as fast as possible in order to prepare it to be a new master. However after executing "CHANGE MASTER TO MASTER_DELAY = 0" the slave deletes all binary logs with all data changes and makes restoring impossible.

How to repeat:
For example, we have standard master-slave replication with a delay = N seconds (6 hours in this case, but in general it can be any delay)

Slave is working (IO_THREAD and SQL_THREAD are running), SQL_DELAY = 21600.

Creating some content on master:

create database mygolenk;
create table mygolenk.user like mysql.user;
insert into mygolenk.user select * from mysql.user;

Then shutting down the master (simulating master crash and loosing all binary logs on master).

Thus, our master is down and our aim is to recover database using slave (6 hour lagged) + downloaded binlogs on slave with all data that have been changed for the last 6 hours.

List of binary logs on a slave:

[root@testforlab-lin2 bin-logs]# ls -lh
total 817M
-rw-rw---- 1 mysql mysql  120 Jul 18 10:26 mysql-bin.000022
-rw-rw---- 1 mysql mysql   33 Jul 18 10:27 mysql-bin.index
-rw-rw---- 1 mysql mysql 211M Jul 18 10:35 mysql-relay-bin.000003
-rw-rw---- 1 mysql mysql 606M Jul 18 11:30 mysql-relay-bin.000004
-rw-rw---- 1 mysql mysql   78 Jul 18 10:44 mysql-relay-bin.index

With all changes for last 6 hours, including our create database mygolenk:

[root@testforlab-lin2 bin-logs]# mysqlbinlog mysql-relay-bin.000004 | grep -B5 -i "create database"
# at 631506131
#170718 11:28:26 server id 1  end_log_pos 631506074 CRC32 0x64d1c757    Query   thread_id=197   exec_time=0     error_code=0
SET TIMESTAMP=1500341306/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
create database mygolenk

Relay-log.info file content:

[root@testforlab-lin2 mysql]# cat relay-log.info
/mysql/bin-logs/mysql-relay-bin.000004
423141238
mysql-bin.000005
423141075
21600

Master.info file content:

[root@testforlab-lin2 mysql]# cat master.info
mysql-bin.000005
634712142

Master crashed at mysql-bin.000005 pos 634712142 (last event in the binlog)

On slave we have:

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Reconnecting after a failed master event read

              Master_Log_File: mysql-bin.000005
          Read_Master_Log_Pos: 634712142
               Relay_Log_File: mysql-relay-bin.000004
                Relay_Log_Pos: 429552454
        Relay_Master_Log_File: mysql-bin.000005
             Slave_IO_Running: Connecting
            Slave_SQL_Running: Yes

          Exec_Master_Log_Pos: 429552291
              Relay_Log_Space: 855898791

        Seconds_Behind_Master: 21376

                Last_IO_Errno: 2003
                Last_IO_Error: error reconnecting to master 'replication@10.147.39.20:3306' - retry-time: 60  retries: 21
               Last_SQL_Errno: 0
               Last_SQL_Error:

                    SQL_Delay: 21600
          SQL_Remaining_Delay: 224
      Slave_SQL_Running_State: Waiting until MASTER_DELAY seconds after master executed event

Trying to rollforward all transactions until the position when the master crashed:

START SLAVE UNTIL MASTER_LOG_FILE='mysql-bin.000005', MASTER_LOG_POS=634712142;

1 warning in a result:
mysql> show warnings;
+-------+------+--------------------------+
| Level | Code | Message                  |
+-------+------+--------------------------+
| Note  | 1254 | Slave is already running |
+-------+------+--------------------------+

Stopping SQL_THREAD (SQL_TREAD only to protect downloaded binary logs):

mysql> stop slave SQL_THREAD;
Query OK, 0 rows affected (0.00 sec)

mysql> START SLAVE UNTIL MASTER_LOG_FILE='mysql-bin.000005', MASTER_LOG_POS=634712142;
Query OK, 0 rows affected (0.00 sec)

SQL_THREAD is running, but it still executing transactions according SQL_DELAY

mysql> show slave status\G
*************************** 1. row ***************************
      Slave_SQL_Running_State: Waiting until MASTER_DELAY seconds after master executed event

We need to remove SQL_Delay (set it to 0) and roll forward all transactions from the binary logs.
Trying change master delay without stopping IO_THREAD

mysql> stop slave SQL_THREAD;
Query OK, 0 rows affected (0.05 sec)
mysql> change master to master_delay=0;
ERROR 1198 (HY000): This operation cannot be performed with a running slave; run STOP SLAVE first

It is necessary to stop both threads to set sql_delay.
Stopping slave and changing SQL_Delay:

mysql> stop slave;
Query OK, 0 rows affected (0.00 sec)

mysql> change master to master_delay=0;
Query OK, 0 rows affected (0.37 sec)

[root@testforlab-lin2 bin-logs]# ls -lh
total 16K
-rw-rw---- 1 mysql mysql 120 Jul 18 10:26 mysql-bin.000022
-rw-rw---- 1 mysql mysql  33 Jul 18 10:27 mysql-bin.index
-rw-rw---- 1 mysql mysql 120 Jul 18 12:14 mysql-relay-bin.000001
-rw-rw---- 1 mysql mysql  39 Jul 18 12:14 mysql-relay-bin.index

Finally, we set SQL_Delay = 0, but we lost all downloaded binary logs.

mysql> start slave;
Query OK, 0 rows affected (0.01 sec)

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Connecting to master

              Master_Log_File: mysql-bin.000005
          Read_Master_Log_Pos: 442374704
               Relay_Log_File: mysql-relay-bin.000001
                Relay_Log_Pos: 4
        Relay_Master_Log_File: mysql-bin.000005
             Slave_IO_Running: Connecting
            Slave_SQL_Running: Yes

          Exec_Master_Log_Pos: 442374704
              Relay_Log_Space: 120

        Seconds_Behind_Master: NULL

                Last_IO_Errno: 2003
                Last_IO_Error: error connecting to master 'replication@10.147.39.20:3306' - retry-time: 60  retries: 1
               Last_SQL_Errno: 0
               Last_SQL_Error:

                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
+--------------------+
4 rows in set (0.00 sec)

No mygolenk database. All data for last 6 hours have been lost

Suggested fix:

In case when the master is down and master's binlogs are not available anymore - changing master_delay leads to purging all downloaded binary logs and loosing all data changes for the period of SQL_Delay.

Alternatively, it is possible to wait for 6 hours while SQL_THREAD executes all transactions. If the delay is 1 day - it will take 1 day to recover the database. Obviously it's not appropriate solution.

Delayed replication is a good tool that can help DBAs to recover database after user's destructive commands like drop, delete or update (start slave until). However it can not help in case of master's physical crash.

Workaround: after master's crash stop SQL_THREAD on a slave, check the file name and position of the last executed transaction (SP). Convert binary logs to a text file using mysqlbinlog tool with a start position (SP). Execute all transactions using mysql client (mysql -u -p < mysql-relay-bin.000005.text).

Suggested fix: Change master to master_delay = 0 should not purge all relay logs on a slave.
[23 Jul 2017 7:33] MySQL Verification Team
Bug 26410977 - CHANGING MASTER_DELAY DISCARDS RELAY LOGS IF IO_THREAD IS STOPPED
[24 Jul 2017 11:41] MySQL Verification Team
Hi,

Thanks for your report. This is a duplicate of Bug #26410977 - CHANGING MASTER_DELAY DISCARDS RELAY LOGS IF IO_THREAD IS STOPPED

A workaround is to use only STOP SLAVE SQL_THREAD; rather than STOP SLAVE;
before executing CHANGE MASTER TO MASTER_DELAY=N;

all best
Bogdan
[24 Jul 2017 11:47] Дмитрий Афонасов
Dear support team,

mysql> stop slave SQL_THREAD;
Query OK, 0 rows affected (0.05 sec)
mysql> change master to master_delay=0;
ERROR 1198 (HY000): This operation cannot be performed with a running slave; run STOP SLAVE first

Thus, STOP SLAVE must be executed before executing CHANGE MASTER TO MASTER_DELAY=0.

Best regards.
[24 Jul 2017 12:39] MySQL Verification Team
Hi,

That workaround works on 5.7, on 5.6 there's no workaround I'm afraid

The bug is being processed, not much we can do now but wait for it to be fixed.

all best
Bogdan

slave1 [localhost] {msandbox} ((none)) > CHANGE MASTER TO MASTER_DELAY = 100000;
ERROR 3085 (HY000): This operation cannot be performed with a running slave sql thread; run STOP SLAVE SQL_THREAD FOR CHANNEL '' first.
slave1 [localhost] {msandbox} ((none)) > STOP SLAVE SQL_THREAD;
Query OK, 0 rows affected (0.00 sec)

slave1 [localhost] {msandbox} ((none)) > CHANGE MASTER TO MASTER_DELAY = 100000;
Query OK, 0 rows affected (0.00 sec)

slave1 [localhost] {msandbox} ((none)) > start slave;
Query OK, 0 rows affected (0.00 sec)

slave1 [localhost] {msandbox} ((none)) > CHANGE MASTER TO MASTER_DELAY = 0;
ERROR 3085 (HY000): This operation cannot be performed with a running slave sql thread; run STOP SLAVE SQL_THREAD FOR CHANNEL '' first.
slave1 [localhost] {msandbox} ((none)) > STOP SLAVE SQL_THREAD;
Query OK, 0 rows affected (0.00 sec)

slave1 [localhost] {msandbox} ((none)) > CHANGE MASTER TO MASTER_DELAY = 0;
Query OK, 0 rows affected (0.00 sec)

slave1 [localhost] {msandbox} ((none)) > select @@version;
+------------+
| @@version  |
+------------+
| 5.7.19-log |
+------------+
1 row in set (0.00 sec)
[25 Jul 2017 2:39] Mikhail Golenkov
Dear support team,

Yes, It works on 5.7. Thank you.

Unfortunately, one of our critical application works with 5.6 only. 
Hope, that the same workaround will be available on 5.6 soon.

Best regards.
[25 Jul 2017 4:51] MySQL Verification Team
:D thanks for confirmation.
I rather think the bug will be fixed and the workaround will not be needed.

all best
Bogdan
[20 Sep 2017 6:22] Venkatesh Duggirala
Post by Developer:
==================

Hello Mikhail, 

Thank you for the bug report. There are two things that I would like to mention
in this bug while I mark its status as "Not a bug"

1) In the description(and the title) you were mentioning "binlogs". Actually the term for the logs that were downloaded on Slave by receiver thread are called "Relay logs". I assume that you were actually talking about "Relay logs" are getting purged and not "binary logs". Please correct me if that is not the case.

2) Assuming that point-1 holds good, purging the relay logs during "Change master" command is known, documented legacy behaviour if you do not specify
Relay_log_file/Relay_log_pos in the command.

Please have a look at 
 
https://dev.mysql.com/doc/refman/5.6/en/change-master-to.html  , this says 

"CHANGE MASTER TO" deletes all relay log files and starts a new one, unless
you specify RELAY_LOG_FILE or RELAY_LOG_POS"

Please let me know if you see any issues with the relay log files even when you specify one of these two values.

Regards,
Venkatesh.