Bug #77496 | Replication position lost after crash on MTS configured slave | ||
---|---|---|---|
Submitted: | 26 Jun 2015 6:11 | Modified: | 20 May 2016 11:56 |
Reporter: | Simon Mudd (OCA) | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
Version: | 5.6.25, 5.6.27 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | Failed to initialize the master info structure, MTS, recovery |
[26 Jun 2015 6:11]
Simon Mudd
[26 Jun 2015 6:44]
Simon Mudd
/etc/my.cnf for server which had the problem.
Attachment: etc_my.cnf.slave-5-6-25 (application/octet-stream, text), 3.22 KiB.
[30 Jun 2015 9:15]
Trent Lloyd
I wonder if the following errors from having relay_log_recovery enabled are causing the server to actually remove the multi-threaded slave execution state. Will test this. 2015-06-24 13:49:03 3895 [ERROR] --relay-log-recovery cannot be executed when the slave was stopped with an error or killed in MTS mode; consider using RESET SLAVE or restart the server with --relay-log-recovery = 0 followed by START SLAVE UNTIL SQL_AFTER_MTS_GAPS 2015-06-24 13:49:03 3895 [ERROR] Failed to initialize the master info structure 2015-06-24 14:01:37 3895 [ERROR] Slave SQL: Slave failed to initialize relay log info structure from the repository, Error_code: 1872
[30 Jun 2015 10:42]
Trent Lloyd
The replication position does not appear lost at all, the slave is failing to initialize due to relay_log_recovery=on which is not compatible with multi-threaded slave. (It is compatible in 5.6.26+ when using GTID). slave_worker_info is empty because it is not used unless relay-log-info-repository=TABLE Lastly the slave state will not be crash safe/consistent unless relay-log-info-repository=TABLE is used. This is documented here: http://dev.mysql.com/doc/refman/5.6/en/slave-logs.html#replication-implementation-crash-sa... There is one main outstanding issue here which is that without GTID, the relay logs are likely to become corrupt and relay_log_recovery=on cannot be used. Starting with relay_log_recovery=off will let you use SQL_AFTER_MTS_GAPS to catch up and then reset the relay logs but that would be a manual process at this time.
[1 Jul 2015 22:25]
Simon Mudd
Trent. "the slave is failing to initialize due to relay_log_recovery=on which is not compatible with multi-threaded slave". http://dev.mysql.com/doc/refman/5.6/en/replication-options-slave.html does not state that this is incompatible. Or not imply that it should not be used. (I'm not using GTID.) If this is the case should MySQLD either startup in MTS mode but with relay_log_recovery disabled, or start with relay_log_recovery = 1 and MTS disabled? Either way this does not seem intuitive or clear. Safer recovery options would seem to be helpful. I was using relay-log-info-repository=TABLE so mysql.slave_worker_info and this is supposed to ensure crash safety. Relay logs can not be corrupt as the binlog checksum should allow us to catch that. it won't solve the problem but it will ensure that if the file is corrupt this will be noticed. Truncating at the last known good point would seem like a good idea, scanning from the beginning of the last relay log file to ensure there's been no corruption might be a good idea. So I'm not fully sure I understand the comments here. The configuration I was using was in theory supposed to be somewhat safe and the replication configuration was stored in InnoDB tables. Therefore InnoDB recovery would have been expected to restore some state even if that state were not strictly speaking 100% correct.
[27 Jul 2015 11:33]
MySQL Verification Team
Hello Simon, Thank you for the feedback. I'm not sure whether this is intentional behavior but observed this during my tests. Thanks, Umesh
[27 Jul 2015 11:33]
MySQL Verification Team
test results
Attachment: 77496.5_6_27.results (application/octet-stream, text), 14.89 KiB.
[21 Jan 2016 20:47]
Jean-François Gagné
Hi, I am observing the same behavior for: -MySQL 5.6.28, -MySQL 5.7.10 with slave-parallel-type=DATABASE, -MySQL 5.7.10 with slave-parallel-type=LOGICAL_CLOCK with slave_preserve_commit_order=0, -and MySQL 5.7.10 with slave-parallel-type=LOGICAL_CLOCK with slave_preserve_commit_order=1. Concerning slave-parallel-type=LOGICAL_CLOCK with slave_preserve_commit_order=1, it is a special case of this bug where no gap exists. I logged Bug #80103 for it as if there is more detail in fixing the general case, I would like to see this special case fixed with less delay. Related reports: Bug #80102. Thanks, JFG
[22 Jan 2016 4:52]
Jesper wisborg Krogh
Posted by developer: See also Bug 18881475/Bug 72824
[20 May 2016 11:56]
David Moss
Thanks for your feedback, this has been fixed in upcoming versions and the following was added to the 5.6.31 and 5.7.13 changelogs: If a multi-threaded replication slave running with relay_log_recovery=1 stopped unexpectedly, during restart the relay log recovery process could fail. This was due to transaction inconsistencies not being filled, see Handling an Unexpected Halt of a Replication Slave. Prior to this fix, to recover from this situation required manually setting relay_log_recovery=0, starting the slave with START SLAVE UNTIL SQL_AFTER_MTS_GAPS to fix any transaction inconsistencies and then restarting the slave with relay_log_recovery=1. This process has now been automated, enabling relay log recovery of a multi-threaded slave upon restart automatically.
[5 Aug 2016 13:26]
David Moss
Posted by developer: After discussing with Sujatha, this new request is now being covered in Bug#21507981. Jesper has relayed this message via the SR mentioned above. Reclosing.
[5 Aug 2016 13:46]
J-F Not Use Gagné
See Bug#81840 for new request.
[19 Sep 2016 8:23]
Erlend Dahl
Bug#72824 Can't restart replication after restart with slave_parallel_workers and error was marked as a duplicate
[9 Feb 2017 23:04]
Vojtech Kurka
We ran into this today after a crash on MT slave with 5.7.16 and 5.7.15 master. This workaround did work: 1) keep record of the info in Relay log info;(show slave status) 2) stop slave; 3) reset slave; 4) start slave; 5) stop slave; 6) SET GLOBAL gtid_purged='gtid position recorded in step 1'; change master to...; 7) start slave; Did we run into https://bugs.mysql.com/bug.php?id=81840 ? or is there any other unfixed bug?