Description:
Hi,
I am opening this bug to suggest a solution to other bugs. Some could say that this is a feature request, but I am classifying this as S2 as this is a solution to a S2 bug (Bug#81840). The corresponding bugs are the following:
Bug#74321: Execute relay-log-recovery only when needed.
Bug#74323: Avoid overloading the master NIC on relay-log-recovery of a lagging slave.
Bug#74324: Make keeping relay logs (relay_log_purge=0) crash safe.
Bug#81840: Automatic Replication Recovery Does Not Handle Lost Relay Log Events.
All those bugs have the following root cause: relay log recovery to too simplistic. By implementing a better relay log recovery, all those could be solved, with the most important being IMHO Bug#81840 that makes MTS non-replication crash safe without GTIDs.
So please consider implementing a better relay log recovery.
Many thanks for looking into that,
JFG
How to repeat:
See the corresponding bugs:
Bug#74321: Execute relay-log-recovery only when needed.
Bug#74323: Avoid overloading the master NIC on relay-log-recovery of a lagging slave.
Bug#74324: Make keeping relay logs (relay_log_purge=0) crash safe.
Bug#81840: Automatic Replication Recovery Does Not Handle Lost Relay Log Events.
Suggested fix:
1) To solve Bug#74323, scanning the relay logs on relay log recovery could be implemented to only get rid of the part of the relay logs that are corrupted.
2) A way to solve Bug#74321 would be to but an additional flag in the master-info table to indicate that the IO Thread has been stopped in a clean way. When the IO Thread would be started, this would be set as FALSE. When the IO Thread is stopped, it would be set to TRUE.
2b) If #2 above is too impactful, #1 above can also limit the impacts of doing relay log recovery on every restart, hence providing an alternative solution to Bug#74321 (maybe a little IO intensive, but better than re-downloading binlogs from the master).
3) To solve Bug#74324, a combination of solution #1 above for the case where the SQL Thread is in valid relay logs, and #2 + replacing the SQL Thread at the right place in the newly downloaded binlogs would do.
4) To solve Bug#81840, we need to 1st download binlog and then to fix the relay log position in the mysql.slave_worker_info table. This is tedious, but not overly complicated.