MySQL Bugs: #93081: Please implement a better relay log recovery.

Bug #93081	Please implement a better relay log recovery.
Submitted:	5 Nov 2018 11:08	Modified:	8 Nov 2018 8:13
Reporter:	J-F Legacy Gagné	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.6, 5.7, 8.0	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Hi,

I am opening this bug to suggest a solution to other bugs. Some could say that this is a feature request, but I am classifying this as S2 as this is a solution to a S2 bug (Bug#81840). The corresponding bugs are the following:

Bug#74321: Execute relay-log-recovery only when needed.
Bug#74323: Avoid overloading the master NIC on relay-log-recovery of a lagging slave.
Bug#74324: Make keeping relay logs (relay_log_purge=0) crash safe.
Bug#81840: Automatic Replication Recovery Does Not Handle Lost Relay Log Events.

All those bugs have the following root cause: relay log recovery to too simplistic. By implementing a better relay log recovery, all those could be solved, with the most important being IMHO Bug#81840 that makes MTS non-replication crash safe without GTIDs.

So please consider implementing a better relay log recovery.

Many thanks for looking into that,

JFG

How to repeat:
See the corresponding bugs:

Suggested fix:
1) To solve Bug#74323, scanning the relay logs on relay log recovery could be implemented to only get rid of the part of the relay logs that are corrupted.

2) A way to solve Bug#74321 would be to but an additional flag in the master-info table to indicate that the IO Thread has been stopped in a clean way. When the IO Thread would be started, this would be set as FALSE. When the IO Thread is stopped, it would be set to TRUE.

2b) If #2 above is too impactful, #1 above can also limit the impacts of doing relay log recovery on every restart, hence providing an alternative solution to Bug#74321 (maybe a little IO intensive, but better than re-downloading binlogs from the master).

3) To solve Bug#74324, a combination of solution #1 above for the case where the SQL Thread is in valid relay logs, and #2 + replacing the SQL Thread at the right place in the newly downloaded binlogs would do.

4) To solve Bug#81840, we need to 1st download binlog and then to fix the relay log position in the mysql.slave_worker_info table. This is tedious, but not overly complicated.

If checksums are enabled this should work fine.

Hi Jean-François,

Thank you for the report and suggestions.
Verifying this bug so as not to lose valuable suggestions from this bug  report(Sounds like wl# with many related issues, referenced 3/3 Feature Requests in the bug report are already verified so eventually this might well be closed as a duplicate of one of the listed Bug(s)#).

regards,
Umesh