MySQL Bugs: #81840: Automatic Replication Recovery Does Not Handle Lost Relay Log Events

Bug #81840	Automatic Replication Recovery Does Not Handle Lost Relay Log Events
Submitted:	14 Jun 2016 4:53
Reporter:	Jesper wisborg Krogh	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.6	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
This is a follow-up to Bug 21507981 / http://bugs.mysql.com/bug.php?id=77496 .

The above bug fix handles some of the replication crash recovery cases, but there is one missing:

- we have transactions A, B, C, D, E in the relay logs, all from different independent schema (they can be run in parallel)

- everything up to and including transaction A is committed --> Relay_Master_Log_File and Exec_Master_Log_Pos point to transaction B

- transaction C and E are committed, but B and D are not --> we have gaps

- transactions up to and including C are synced in the relay logs, but D and E are not (sync_relay_log = 10000 by default, so many events can be in the relay logs without being synced to disk)

- the OS crashes --> D and E disappear from the relay logs

- after restarting the OS and MySQL, relay_log_recovery=1 fails, so we restart MySQL with relay_log_recovery=0 and skip-slave-start, but START SLAVE UNTIL SQL_AFTER_MTS_GAPS would also fail as it is not able to run transaction D because it disappeared from the relay logs after the OS crash

How to repeat:
See above

Suggested fix:
Support recovering the missing relay log events so recovery can proceed.

We probably ran into this today after a crash on MT slave with 5.7.16 and 5.7.15 master.
After system restart, slave reported "Slave failed to initialize relay log info structure from the repository".

 This workaround did work:

1) keep record of the info in Relay log info；(show slave status)
2) stop slave；
3) reset slave;
4) start slave; 
5) stop slave;
6) SET GLOBAL gtid_purged='gtid position recorded in step 1'; change master to...；
7) start slave;

We seem to be  hitting the issue with 5.7.18
The bug is marked as verified. Is there a fix planned for it ?

Thanks
Prasad

Two work-arounds for this bug:

1) sync_relay_log = 1
2) slave_preserve_commit_order = ON (when slave_parallel_mode = LOGICAL_CLOCK)

See Bug#93081 for a potential solution.

Is this fixed (in the GTID auto-position case) by https://bugs.mysql.com/bug.php?id=92882 (fixed in 5.7.28, 8.0.18)?