MySQL Bugs: #71863: Improve relay_log_recovery behaviour for delayed slaves

Bug #71863	Improve relay_log_recovery behaviour for delayed slaves
Submitted:	27 Feb 2014 12:53	Modified:	2 May 2014 5:19
Reporter:	Simon Mudd (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S4 (Feature request)
Version:	5.6.15/5.7	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	delayed slaves, recovery

Description:
When trying to configure a MySQL to be crash safe one of the recommended options is to enable the option relay_log_recovery = 1.

This sounds good and under many cases resolves the problem of a crash on a slave where the relay log position is not correctly synced to the actual last writes that the I/O thread actually made.

Therefore on recovery if no care is made the last relay log file will be longer than the I/O thread "thinks" and it will then append information which is already written there. This leads to the typical duplicate key errors that we see.

So relay_log_recovery resolves that issue as when the slave starts it just drops the relay logs and connects to the point that the SQL thread knows it has processed, and then collects information from that position.

How to repeat:
There may be occasions when a server has a large number of relay logs.

Two examples come to mind:
(1) I/O on the server is high enough so that the I/O thread is able to pull down values faster than the SQL thread(s) can process, thus building up a replication delay
(2) A deliberately delayed slave (however that may be setup) deliberately dos not execute the latest relay log information but "throttles itself" to be a certain period of time behind. That may be minutes, hours or days. I typically have several slaves configured with a deliberate 1 day replication lag.

So if you configure relay_log_recovery = 1 on these servers you DROP all this information, and have to pull it from the master again, and that assumes the information is still on the master.

In cases like this dropping these relay logs can be very expensive (time to pull the data from the master again), or can be fatal if the data is no longer on the master.

Suggested fix:
Given the above issues, and they happen especially on large systems (one I'm currently working on has 300GB of relay logs as it catches up from being 3 days behind after the database was cloned to another system) it would be nice if this could be treated in a better way.

One idea comes to mind is that the relay log connects to the same position as the SQL thread on startup, after seeing where it thinks it is relative to the master, and it _reads_ and _validates_ the relay logs from this time until it gets to the end of the relay logs. If the amount of relay logs is small this process will not take long. If the amount is large then it will take longer but given the "events" do not need to be executed validating them should be fast. There's a event checksum in 5.6 and later so probably all that's required is to check the events match their checksum and each event is "complete".
(1) if the relay log information goes "past" the position that the i/o thread thinks it had written to and the events are good, then it should just continue and update it's reference to the latest position it is aware of in the relay log files or tables.
(2) when it gets to an invalid or incomplete event then it should:
(2a) report the error and STOP , this is a conservative behaviour, no data should be dropped.
This at least confirms data is good up to this point, which may allows the SQL thread to keep working until the same position
(2b) report the error, drop the relay logs from this point , and start to write relay logs form this next logical position (in a new relay log file). This behaviour allows recovery to happen as much as possible on the server itself, and prevents the need to pull over much data from the master.
(3) if the I/O thread gets to the end of the existing relay log files then it should (as now?) create a new only, and then connect to the master and start to write as normal.

These changes would make the procedure functionally equivalent to the current behaviour, would reduce the need to pull unnecessary data from the master and for situations like those listed above would ensure that we have both a replication safe recovery and we don't lose data.

To enable such behaviour I would suggest that an extra option is added to relay_log_recovery, that is

relay_log_recovery = 2

Scan existing relay logs and if any error happens, stop and report it. DO NOT delete any relay log files

relay_log_recovery = 3

Scan existing relay logs as much as possible. When finding the first error report it, truncate the relay log file at that point (and any later log files), and then continue from that point as now (writing to a new relay log file)

For my systems I would configure delayed slaves with the value of 2 and all servers except delayed slaves with the value of 3.

Adjusted Synopsis to be more concise.

Perhaps a text based name would be better, something like:

(0) NO_RECOVERY
(1) DROP_LOGS
(2) SCAN_LOGS_STOP_ON_ERROR
(3) SCAN_LOGS_DROP_AFTER_ERROR

Thank you for your feature request.