Bug #93081 Please implement a better relay log recovery.
Submitted: 5 Nov 2018 11:08 Modified: 8 Nov 2018 8:13
Reporter: Jean-François Gagné Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.6, 5.7, 8.0 OS:Any
Assigned to: CPU Architecture:Any

[5 Nov 2018 11:08] Jean-François Gagné
Description:
Hi,

I am opening this bug to suggest a solution to other bugs.  Some could say that this is a feature request, but I am classifying this as S2 as this is a solution to a S2 bug (Bug#81840).  The corresponding bugs are the following:

Bug#74321: Execute relay-log-recovery only when needed.
Bug#74323: Avoid overloading the master NIC on relay-log-recovery of a lagging slave.
Bug#74324: Make keeping relay logs (relay_log_purge=0) crash safe.
Bug#81840: Automatic Replication Recovery Does Not Handle Lost Relay Log Events.

All those bugs have the following root cause: relay log recovery to too simplistic.  By implementing a better relay log recovery, all those could be solved, with the most important being IMHO Bug#81840 that makes MTS non-replication crash safe without GTIDs.

So please consider implementing a better relay log recovery.

Many thanks for looking into that,

JFG

How to repeat:
See the corresponding bugs:

Bug#74321: Execute relay-log-recovery only when needed.
Bug#74323: Avoid overloading the master NIC on relay-log-recovery of a lagging slave.
Bug#74324: Make keeping relay logs (relay_log_purge=0) crash safe.
Bug#81840: Automatic Replication Recovery Does Not Handle Lost Relay Log Events.

Suggested fix:
1) To solve Bug#74323, scanning the relay logs on relay log recovery could be implemented to only get rid of the part of the relay logs that are corrupted.

2) A way to solve Bug#74321 would be to but an additional flag in the master-info table to indicate that the IO Thread has been stopped in a clean way.  When the IO Thread would be started, this would be set as FALSE.  When the IO Thread is stopped, it would be set to TRUE.

2b) If #2 above is too impactful, #1 above can also limit the impacts of doing relay log recovery on every restart, hence providing an alternative solution to Bug#74321 (maybe a little IO intensive, but better than re-downloading binlogs from the master).

3) To solve Bug#74324, a combination of solution #1 above for the case where the SQL Thread is in valid relay logs, and #2 + replacing the SQL Thread at the right place in the newly downloaded binlogs would do.

4) To solve Bug#81840, we need to 1st download binlog and then to fix the relay log position in the mysql.slave_worker_info table.  This is tedious, but not overly complicated.
[5 Nov 2018 16:21] Daniël van Eeden
If checksums are enabled this should work fine.
[8 Nov 2018 8:13] Umesh Shastry
Hi Jean-François,

Thank you for the report and suggestions.
Verifying this bug so as not to lose valuable suggestions from this bug  report(Sounds like wl# with many related issues, referenced 3/3 Feature Requests in the bug report are already verified so eventually this might well be closed as a duplicate of one of the listed Bug(s)#).

regards,
Umesh