MySQL Bugs: #90107: RSync through snapshot using Multi-Thread replication fails because of duplicate

Bug #90107	RSync through snapshot using Multi-Thread replication fails because of duplicate
Submitted:	16 Mar 2018 11:18	Modified:	20 Apr 2018 7:50
Reporter:	Andrea Dante Bozzola	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.7.21-enterprise-commercial-advanced-lo	OS:	CentOS (6.9 (Final))
Assigned to:	MySQL Verification Team	CPU Architecture:	Any (x86_64)

Description:
I used one of the other slaves of the former master of my server as a donor and run an RSync to sync the lv containing datadir, innodb_data_files and innodb_log_files target.

All MySQL boxes run the same OS version and the same MySQL version.
All use Multi-Threaded replication with ROW-Based binary logs format with GTID and MASTER_AUTO_POSITION = 1

+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| master_info_repository | TABLE |
| slave_parallel_workers | 8     |
| slave_parallel_type | LOGICAL_CLOCK |
+---------------------+---------------+
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Variable_name                    | Value                                                                                                                                                                  |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| binlog_gtid_simple_recovery      | ON                                                                                                                                                                     |
| enforce_gtid_consistency         | ON                                                                                                                                                                     |
| gtid_executed_compression_period | 1000                                                                                                                                                                   |
| gtid_mode                        | ON                                                                                                                                                                     |
| gtid_next                        | AUTOMATIC                                                                                                                                                              |
| gtid_owned                       |                                                                                                                                                                        |
| session_track_gtids              | OFF                                                                                                                                                                    |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
To keep the same server UUID for the target, its auto.cnf filed has been kept
The replication fails because of a duplicate entry in the binary log which was open on the master instance when the replication was stopped on the donor

NOTE: the donor does not have any replication error

How to repeat:
+ PREREQUISITES
- three MySQL instances (one master and two slaves) with (at least) ~4k queries/s using ROW-based binary logs format
- configure Multi-Threaded replication and GTID as above
- MASTER_AUTO_POSITION = 1
- stop replication between the donor and the master
++ DONOR (one of the two slaves)
-- STOP SLAVE;START SLAVE UNTIL SQL_AFTER_MTS_GAPS;
-- SHOW MASTER STATUS\G until sql_thread is stopped
-- FLUSH TABLES WITH READ LOCK;
-- SHOW MASTER STATUS; SHOW SLAVE STATUS\G (this is just to have track of the positions and be sure nothing is writing before tajing the snapshot)
-- take the snapshot
-- UNLOCK TABLES; START SLAVE;
++ TARGET
-- save auto.cnf
-- rsync from donor
-- restore saved auto.cnf
-- start mysqld service
-- start replication
-- once replication reads the binary log open at the time the slave has been stopped an error should happen

Hi,

I don't believe we shown the "flush tables with read lock; then take snapshot" in our documentation as a safe thing to do with innodb. Please correct me if I'm wrong.

Thanks
Bogdan

Hi Bogdan

the documentation for FLUSH TABLES does not specify anything about INNODB (https://dev.mysql.com/doc/refman/5.7/en/flush.html), nor anything is specified in the "How MySQL Handles FLUSH TABLES" (https://dev.mysql.com/doc/internals/en/flush-tables.html) page.

If this is not the best way to handle it we are open to suggestions, we are not aware of any other way to lock the tables in a consistent state and we have used the same procedure so far with no errors on single-threaded replication.

Hi,
Well nowhere is stated that it's safe to copy raw binary files (except for myisam) in any situation with mysql so.. it is not a safe thing normally.

You can do it in some cases but there are issues as you seen.

mysql backup could be solution
https://www.mysql.com/products/enterprise/backup.html

I tried to backup with mysqlbackup, instead flush/rsync, and was not able to reproduce the problem.

Now, to be sure I did not miss something you can try to reproduce your issue on a staging platform but not by doing flush/rsync but shutdown/rsync (yes I know it is not the same) and if it works I'm right and the problem is unsafe rsync of live files (as flush tables does not leave tables in consistent state as innodb continues to write to those files after flush) but if you reproduce the problem with shutdown then there's something else wrong and we should revisit the problem

kind regards
Bogdan

Thank you, we will try the solution you propose asap

Shutting down the donor instance works.

One question, in case we cannot afford to shutdown the instance, can tablespace copy be a viable solution?

Hi,

I don't see a reason why not attm but can't confirm 100% as I never tried that myself.

It's not a documented procedure for now but should work

all best
Bogdan