Bug #81840 Automatic Replication Recovery Does Not Handle Lost Relay Log Events
Submitted: 14 Jun 2016 4:53
Reporter: Jesper wisborg Krogh Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.6 OS:Any
Assigned to: CPU Architecture:Any

[14 Jun 2016 4:53] Jesper wisborg Krogh
Description:
This is a follow-up to Bug 21507981 / http://bugs.mysql.com/bug.php?id=77496 .

The above bug fix handles some of the replication crash recovery cases, but there is one missing:

- we have transactions A, B, C, D, E in the relay logs, all from different independent schema (they can be run in parallel)

- everything up to and including transaction A is committed --> Relay_Master_Log_File and Exec_Master_Log_Pos point to transaction B

- transaction C and E are committed, but B and D are not --> we have gaps

- transactions up to and including C are synced in the relay logs, but D and E are not (sync_relay_log = 10000 by default, so many events can be in the relay logs without being synced to disk)

- the OS crashes --> D and E disappear from the relay logs

- after restarting the OS and MySQL, relay_log_recovery=1 fails, so we restart MySQL with relay_log_recovery=0 and skip-slave-start, but START SLAVE UNTIL SQL_AFTER_MTS_GAPS would also fail as it is not able to run transaction D because it disappeared from the relay logs after the OS crash

How to repeat:
See above

Suggested fix:
Support recovering the missing relay log events so recovery can proceed.
[9 Feb 2017 23:06] Vojtech Kurka
We probably ran into this today after a crash on MT slave with 5.7.16 and 5.7.15 master.
After system restart, slave reported "Slave failed to initialize relay log info structure from the repository".

 This workaround did work:

1) keep record of the info in Relay log info;(show slave status)
2) stop slave;
3) reset slave;
4) start slave; 
5) stop slave;
6) SET GLOBAL gtid_purged='gtid position recorded in step 1'; change master to...;
7) start slave;
[4 Jan 2018 8:43] Prasad N
We seem to be  hitting the issue with 5.7.18
The bug is marked as verified. Is there a fix planned for it ?

Thanks
Prasad
[20 Aug 2018 18:16] Jean-François Gagné
Two work-arounds for this bug:

1) sync_relay_log = 1
2) slave_preserve_commit_order = ON (when slave_parallel_mode = LOGICAL_CLOCK)