MySQL Bugs: #80102: Message in log after MTS crash misleading.

Bug #80102	Message in log after MTS crash misleading.
Submitted:	21 Jan 2016 20:08	Modified:	6 Jun 2016 7:29
Reporter:	Jean-François Gagné	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.6.28, 5.7.10, 5.7.12	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Hi,

after a crash on an MTS slave (5.6.28, or 5.7.10), relay log recovery (relay-log-recovery=1) will fail with the following message:

2016-01-18 22:41:22 41180 [ERROR] --relay-log-recovery cannot be executed when the slave was stopped with an error or killed in MTS mode; consider using RESET SLAVE or restart the server with --relay-log-recovery = 0 followed by START SLAVE UNTIL SQL_AFTER_MTS_GAPS
2016-01-18 22:41:22 41180 [ERROR] Failed to initialize the master info structure

This error message is misleading: restarting MySQL with relay-log-recovery=0 alone is dangerous as the relay logs on disk and the slaved IO_Thread position could be out of sync --> restarting the slave could lead to duplicate or not found key errors (in the best case), or in silent data corruption in the worst case.  The right way to restart MySQL is with relay-log-recovery=0 AND skip-slave-start.

Related to Bug #77496.

Thanks,

JFG

How to repeat:
Crash a slave running in MTS mode, and restart MySQL with relay-log-recovery=1.

Suggested fix:
Add to the error message that restarting MySQL with relay-log-recovery=0 should also be done with skip-slave-start.

In case MTS session left any gaps, requirement to fill them as
specified in the error message indeed can be infeasible, as the bug describes.

Yet --relay-log-recovery=1, as binlog-position-based recovery, could be elaborated to work with MTS recovery. It just needs a recovery submode that
will make the slave to resume reading from the master (by IO thread) from
so called low-water-mark execution position that is kept recorded
by the slave applier (Coordinator thread plus Workers).
MTS recovery would receive not so much significant changes
to ignore the pre-crash relay-log coordinates altogether in gaps computing.

Hello Jean,

Thank you for the report and feedback!

Thanks,
Umesh

Solution for the above issue has been implemented as part of https://bugs.mysql.com/bug.php?id=77496 fix.
Fix is available in MySQL versions 5.6.31 and 5.7.13.

If a multi-threaded replication slave running with relay_log_recovery=1 stopped unexpectedly, during restart the relay log recovery process could fail. This was due to transaction inconsistencies not being filled, see Handling an Unexpected Halt of a Replication Slave. Prior to this fix, to recover from this situation required manually setting relay_log_recovery=0, starting the slave with START SLAVE UNTIL SQL_AFTER_MTS_GAPS to fix any transaction inconsistencies and then restarting the slave with relay_log_recovery=1. This process has now been automated, enabling relay log recovery of a multi-threaded slave upon restart automatically. The above mentioned error message has been removed now.

Hence closing this bug as fixed.