MySQL Bugs: #54249: More verbose RBR error logging

Bug #54249	More verbose RBR error logging
Submitted:	5 Jun 2010 2:32	Modified:	6 Jun 2010 17:36
Reporter:	Shannon Wade	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Row Based Replication ( RBR )	Severity:	S3 (Non-critical)
Version:	5.1	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
Currently when replicating to a slave using RBR or mixed (that does a row event) and a row event fails to find a matching key/row on the slave the error is not very descriptive:

100527 15:55:53 [ERROR] Slave SQL: Could not execute Update_rows event on table d1.t1; Can't find record in 't1', Error_code: 1032;
handler error HA_ERR_KEY_NOT_FOUND; the event's master log binlog.004706, end_log_pos 608591709, Error_code: 1032
100527 15:55:53 [Warning] Slave: Can't find record in 't1' Error_code: 1032

This requires the DBA to use mysqlbinlog with  --verbose and/or --base64-output=DECODE-ROWS options to examine he RBR  to deduce which row was not found, it's less obvious with large insert on duplicate key update type events which update hundreds of rows.

Instead of HA_ERR_KEY_NOT_FOUND, there should be a more descriptive error since we know the key value that is missing (presumably in the before image). It would be nice to get errors as you do with duplicate keys indicating the key value or a before image/after image type message:

SBR:
Last_Error: Error 'Duplicate entry '45691' for key 1' on query. Default database: 'db1'. Query: 'INSERT INTO db1t1 (c1,c2,c3) VALUES ('1','2','3')';

RBR:
               Last_SQL_Error: Could not execute Write_rows event on table test.t1; Duplicate entry '2' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mbpro-bin.000001, end_log_pos 2707

Also with INSERT .. ON DUPLICATE KEY UPDATE with hundreds of rows this will translate into hundreds of inserts, updates in one event, one should be able to tell which one of the hundreds caused the problem. 

Current logging of the error only provides minimal information to the DBA who needs to access slave problems at time quickly (while updates are piling up on the master) and effectively decide his options of skipping, manual fixing or resyncing to bring replication back online.

How to repeat:
use binlog_format=row (or mixed and get a row event). 

create table t1 on master, use multi value insert on duplicate key update, truncate the slave directly, perform another insert..on duplicate key. 

mysql> create table t1 (id int primary key);
Query OK, 0 rows affected (0.05 sec)

mysql> insert into t1 (id) values (1),(2),(3),(4) on duplicate key update id=id+10;
Query OK, 4 rows affected (0.00 sec)
Records: 4  Duplicates: 0  Warnings: 0

* on slave *

mysql> delete from t1 where id=2;
Query OK, 1 row affected (0.00 sec)

* Back to master *

mysql> insert into t1 (id) values (1),(2),(3),(4) on duplicate key update id=id+10;
Query OK, 8 rows affected (0.00 sec)
Records: 4  Duplicates: 4  Warnings: 0

mysql> show slave status

<snip>
                   Last_Errno: 1032
                   Last_Error: Could not execute Update_rows event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mbpro-bin.000001, end_log_pos 1261
                 Skip_Counter: 0

Error log:

100604 22:02:01 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:3306',replication started in log 'mbpro-bin.000001' at position 756
100604 22:02:22 [ERROR] Slave SQL: Could not execute Update_rows event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mbpro-bin.000001, end_log_pos 1261, Error_code: 1032
100604 22:02:22 [Warning] Slave: Can't find record in 't1' Error_code: 1032
100604 22:02:22 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mbpro-bin.000001' position 1082

Examining the binlog will provide you with:

#100604 21:51:49 server id 1  end_log_pos 611 	Update_rows: table id 2 flags: STMT_END_F
### UPDATE test.t1
### WHERE
###   @1=1 /* INT meta=0 nullable=0 is_null=0 */
### SET
###   @1=11 /* INT meta=0 nullable=0 is_null=0 */
### UPDATE test.t1
### WHERE
###   @1=2 /* INT meta=0 nullable=0 is_null=0 */
### SET
###   @1=12 /* INT meta=0 nullable=0 is_null=0 */
### UPDATE test.t1
### WHERE
###   @1=3 /* INT meta=0 nullable=0 is_null=0 */
### SET
###   @1=13 /* INT meta=0 nullable=0 is_null=0 */
### UPDATE test.t1
### WHERE
###   @1=4 /* INT meta=0 nullable=0 is_null=0 */
### SET
###   @1=14 /* INT meta=0 nullable=0 is_null=0 */
# at 611

There is no indicator or way to tell what key is missing (without parsing and checking each of these) so that you can quickly diagnose the failure and determine if you can simply insert the missing row to get replication back up. Yes the slave could still be out of sync but with terrabyte slaves it is often preferably to continue replication temporarily until a resync can be performed.

There is an slave_exec_mode=IDEMPOTENT but this is a bit drastic if a quick insert would suffice.

Suggested fix:
Provide more verbose logging.

IDEMPOTENT will skip row changes where a row can't be found as mentioned in the docs thus allowing replication to continue, but in addition it should provide some error logging similar to:

http://bugs.mysql.com/bug.php?id=43407

Seems SLAVE_SKIP_ERRORS would be preferable 1032,etc to IDEMPOTENT now that it works with RBR ( i had not noticed that fix):

http://bugs.mysql.com/bug.php?id=39393

Currently nothing is printed to the error log to indicate that any row changes in an event were skipped. Though I understand we do not recommend IDEMPOTENT, however since SQL_SLAVE_SKIP_COUNTER with RBR skips full events and not rows, customers will likely choose to use IDEMPOTENT or just SLAVE_SKIP_ERRORS if they typically have large events which change many rows.

Currently logging is:

SQL_SLAVE_SKIP_COUNTER will log positions according to:
http://bugs.mysql.com/bug.php?id=43407

But skips full events. Allows some investigation as long as the relay logs are available.

SLAVE_EXEC_MODE=IDEMPOTENT
Logs nothing.

The error/warning should probably give enough information to provide a clue as to the pseudo INSERT,UPDATE or DELETE etc which failed and the "affected row number" and "total number of rows" in the list generated by the statement.

SLAVE_SKIP_ERRORS
Logs nothing.

Here the error should provide some information as to what was skipped and optionally log to the error log or a table for further review.

In addition or in some way with the information if available from  'Record original SQL statement in RBR log"

http://bugs.mysql.com/bug.php?id=50935

Basically for large customers who typically fix replication errors until they can afford the time to resync it's important to later be able to figure out what was skipped and perhaps fix or repapply these statements partially or completely. Or simply determine if for the time being it's OK to loose those statements/events/rows. Right now with RBR it's easy to skip problems but figuring out the problem or fixing is tricky and time-consuming.