Bug #90107 RSync through snapshot using Multi-Thread replication fails because of duplicate
Submitted: 16 Mar 2018 11:18 Modified: 20 Apr 2018 7:50
Reporter: Andrea Dante Bozzola Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.7.21-enterprise-commercial-advanced-lo OS:CentOS (6.9 (Final))
Assigned to: Bogdan Kecman CPU Architecture:Any (x86_64)

[16 Mar 2018 11:18] Andrea Dante Bozzola
Description:
I used one of the other slaves of the former master of my server as a donor and run an RSync to sync the lv containing datadir, innodb_data_files and innodb_log_files target.

All MySQL boxes run the same OS version and the same MySQL version.
All use Multi-Threaded replication with ROW-Based binary logs format with GTID and MASTER_AUTO_POSITION = 1

+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| master_info_repository | TABLE |
| slave_parallel_workers | 8     |
| slave_parallel_type | LOGICAL_CLOCK |
+---------------------+---------------+
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Variable_name                    | Value                                                                                                                                                                  |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| binlog_gtid_simple_recovery      | ON                                                                                                                                                                     |
| enforce_gtid_consistency         | ON                                                                                                                                                                     |
| gtid_executed_compression_period | 1000                                                                                                                                                                   |
| gtid_mode                        | ON                                                                                                                                                                     |
| gtid_next                        | AUTOMATIC                                                                                                                                                              |
| gtid_owned                       |                                                                                                                                                                        |
| session_track_gtids              | OFF                                                                                                                                                                    |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
To keep the same server UUID for the target, its auto.cnf filed has been kept
The replication fails because of a duplicate entry in the binary log which was open on the master instance when the replication was stopped on the donor

NOTE: the donor does not have any replication error

How to repeat:
+ PREREQUISITES
- three MySQL instances (one master and two slaves) with (at least) ~4k queries/s using ROW-based binary logs format
- configure Multi-Threaded replication and GTID as above
- MASTER_AUTO_POSITION = 1
- stop replication between the donor and the master
++ DONOR (one of the two slaves)
-- STOP SLAVE;START SLAVE UNTIL SQL_AFTER_MTS_GAPS;
-- SHOW MASTER STATUS\G until sql_thread is stopped
-- FLUSH TABLES WITH READ LOCK;
-- SHOW MASTER STATUS; SHOW SLAVE STATUS\G (this is just to have track of the positions and be sure nothing is writing before tajing the snapshot)
-- take the snapshot
-- UNLOCK TABLES; START SLAVE;
++ TARGET
-- save auto.cnf
-- rsync from donor
-- restore saved auto.cnf
-- start mysqld service
-- start replication
-- once replication reads the binary log open at the time the slave has been stopped an error should happen
[20 Mar 2018 21:50] Bogdan Kecman
Hi,

I don't believe we shown the "flush tables with read lock; then take snapshot" in our documentation as a safe thing to do with innodb. Please correct me if I'm wrong.

Thanks
Bogdan
[28 Mar 2018 9:18] Andrea Dante Bozzola
Hi Bogdan

the documentation for FLUSH TABLES does not specify anything about INNODB (https://dev.mysql.com/doc/refman/5.7/en/flush.html), nor anything is specified in the "How MySQL Handles FLUSH TABLES" (https://dev.mysql.com/doc/internals/en/flush-tables.html) page.

If this is not the best way to handle it we are open to suggestions, we are not aware of any other way to lock the tables in a consistent state and we have used the same procedure so far with no errors on single-threaded replication.
[2 Apr 2018 14:05] Bogdan Kecman
Hi,
Well nowhere is stated that it's safe to copy raw binary files (except for myisam) in any situation with mysql so.. it is not a safe thing normally.

You can do it in some cases but there are issues as you seen.

mysql backup could be solution
https://www.mysql.com/products/enterprise/backup.html

I tried to backup with mysqlbackup, instead flush/rsync, and was not able to reproduce the problem.

Now, to be sure I did not miss something you can try to reproduce your issue on a staging platform but not by doing flush/rsync but shutdown/rsync (yes I know it is not the same) and if it works I'm right and the problem is unsafe rsync of live files (as flush tables does not leave tables in consistent state as innodb continues to write to those files after flush) but if you reproduce the problem with shutdown then there's something else wrong and we should revisit the problem

kind regards
Bogdan
[6 Apr 2018 14:22] Andrea Dante Bozzola
Thank you, we will try the solution you propose asap
[20 Apr 2018 7:50] Andrea Dante Bozzola
Shutting down the donor instance works.

One question, in case we cannot afford to shutdown the instance, can tablespace copy be a viable solution?
[20 Apr 2018 11:46] Bogdan Kecman
Hi,

I don't see a reason why not attm but can't confirm 100% as I never tried that myself.

It's not a documented procedure for now but should work

all best
Bogdan