MySQL Bugs: #54250: Feature request for RBR auto-fix slave functionality

Bug #54250	Feature request for RBR auto-fix slave functionality
Submitted:	5 Jun 2010 3:18	Modified:	6 Jun 2010 17:20
Reporter:	Shannon Wade	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S4 (Feature request)
Version:		OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
In the case of slave duplicate key and row not found errors when using RBR. There is a before and after image of the row on the master. There should be an option to turn on an auto fix slave or recovery mode of some sort type feature that performs:

* If the row is not found, insert that row
* If there is a duplicate key, update that row

Also if possible include some global counters so admins can monitor these issues:
* Include COUNTERS as to how many rows are recovered this way.

Or if slave_exec_mode=IDEMPOTENT is used, there should be counters as to how many errors are skipped ( this may be a separate feature request ) but those combined would provide counters for
"lost data" or "recovered data" according to the mode used.

Still logging the fact there was a HA_ERR_FOUND_DUPP_KEY or HA_ERR_KEY_NOT_FOUND warning (error if auto-fix is not on) and that things are likely if not certainly out of sync anyways.

This should solely be at the admins description. The reasoning is with very large slaves and an out of sync condition occurs, it is sometimes preferable to fix the error and restart replication until a resync can be done.

Current options are slave_exec_mode=idempotent (which skips the errors), ignoring errors or skipping events, or manually comparing and deciding to update.

Seems dangerous but no more so than slave_exec_mode=idempotent, and sometimes admins will fix dup key and key not found errors (which clearly indicate the slave is out of sync) manually anyways to bring the slave back up (and hope for no more) until a resync can be performed.

How to repeat:
NA

Suggested fix:
See description

Please adjust version to reflect 5.6 and 5.7.

Also worth pointing out that if  binlog_row_image = minimal then it may not be possibly to completely recovery automatically, but it might be desirable to fix things as best as possible.

So I would like to see that the auto-repair mode in the case where a full row is not available to complete the repair would also allow a "best effort repair" (not the default), so in the case of a failed update (the row was missing), an insert of the PK + "those columns provided by the update" are also included.  The other columns are not known so can not be handled, and it may be that they have NOT NULL columns which would then need manual intervention (which is fine) but if they do allow NULLs or have defaults then the incomplete update may be considered better than individual manual fixes.

If "best effort repair mode" is enabled then any events handled this way should be counted separately as normal auto-fix mode should safely repair any broken rows (when it can work).

See: bug#71618 for an example.

Shannon opened this FR for me 5 years ago today.
Today I had to work around an issue which this FR would have helped me with.

So it's still valid when 5.6.25 is the current 5.6 version and 5.7.7 rc is the soon to be GA 5.7 version. Nothing has changed.

To be clear I'd like to see:
* a slave_exec_mode=AUTO_REPAIR which does the following in RBR:
* updates which fail due to a missing row trigger an insert, and a counter is incremented.
* deletes which fail due to a missing row do nothing, and a counter is incremented.
* inserts which fail due to a row already existing trigger an update, and a counter is incremented.
* In the 3 cases above replication continues.
* In the case of minimal RBR try to continue if possible. If you can't then stop (e.g. changing an update to an insert but you don't have all the columns and they don't have defaults might mean you can't continue)

In any case also add similar counters for when you're in slave_exec_mode=IDEMPOTENT:
* idempotent_skip_duplicate_row
* idempotent_skip_missing_row
* idempotent_.... other counters here

This solves a real problem and avoids basically switching to a "suck it and hope for the best" IDEMPOTENT mode which make all errors invisible (that's good , you can't see them).

While many people may be able to stop their system, poke them and make them better for others like myself the most important thing is to keep replication flowing to slaves so you don't rely on a single box which "might" die leaving you with nothing, and later I can look at figuring out why stuff broke, and what to do about it, one task would be to get things back in sync again, but all the time while the "MySQL service" is still running. If you can't take downtime you really can't stop systems as that affects clients and applications which expect the database to be there all the time. So fixing errors requires fixing errors on a broken "but working" system until it's no longer broken etc.

I'd be most delighted if this functionality makes it to MySQL as I've been bitten on a number of occasions by RBR breakage and life is not "good" when this happens. This auto-repair functionality would in many cases allow replication to be self-healing as you'd not only be not making the state of the downstream database servers worse (which effectively the IDEMPOTENT mode does) but you'd actually be making them better.

This sounds like a great idea and way better than current implementation!