Description:
Currently when replicating to a slave using RBR or mixed (that does a row event) and a row event fails to find a matching key/row on the slave the error is not very descriptive:
100527 15:55:53 [ERROR] Slave SQL: Could not execute Update_rows event on table d1.t1; Can't find record in 't1', Error_code: 1032;
handler error HA_ERR_KEY_NOT_FOUND; the event's master log binlog.004706, end_log_pos 608591709, Error_code: 1032
100527 15:55:53 [Warning] Slave: Can't find record in 't1' Error_code: 1032
This requires the DBA to use mysqlbinlog with --verbose and/or --base64-output=DECODE-ROWS options to examine he RBR to deduce which row was not found, it's less obvious with large insert on duplicate key update type events which update hundreds of rows.
Instead of HA_ERR_KEY_NOT_FOUND, there should be a more descriptive error since we know the key value that is missing (presumably in the before image). It would be nice to get errors as you do with duplicate keys indicating the key value or a before image/after image type message:
SBR:
Last_Error: Error 'Duplicate entry '45691' for key 1' on query. Default database: 'db1'. Query: 'INSERT INTO db1t1 (c1,c2,c3) VALUES ('1','2','3')';
RBR:
Last_SQL_Error: Could not execute Write_rows event on table test.t1; Duplicate entry '2' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mbpro-bin.000001, end_log_pos 2707
Also with INSERT .. ON DUPLICATE KEY UPDATE with hundreds of rows this will translate into hundreds of inserts, updates in one event, one should be able to tell which one of the hundreds caused the problem.
Current logging of the error only provides minimal information to the DBA who needs to access slave problems at time quickly (while updates are piling up on the master) and effectively decide his options of skipping, manual fixing or resyncing to bring replication back online.
How to repeat:
use binlog_format=row (or mixed and get a row event).
create table t1 on master, use multi value insert on duplicate key update, truncate the slave directly, perform another insert..on duplicate key.
mysql> create table t1 (id int primary key);
Query OK, 0 rows affected (0.05 sec)
mysql> insert into t1 (id) values (1),(2),(3),(4) on duplicate key update id=id+10;
Query OK, 4 rows affected (0.00 sec)
Records: 4 Duplicates: 0 Warnings: 0
* on slave *
mysql> delete from t1 where id=2;
Query OK, 1 row affected (0.00 sec)
* Back to master *
mysql> insert into t1 (id) values (1),(2),(3),(4) on duplicate key update id=id+10;
Query OK, 8 rows affected (0.00 sec)
Records: 4 Duplicates: 4 Warnings: 0
mysql> show slave status
<snip>
Last_Errno: 1032
Last_Error: Could not execute Update_rows event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mbpro-bin.000001, end_log_pos 1261
Skip_Counter: 0
Error log:
100604 22:02:01 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:3306',replication started in log 'mbpro-bin.000001' at position 756
100604 22:02:22 [ERROR] Slave SQL: Could not execute Update_rows event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mbpro-bin.000001, end_log_pos 1261, Error_code: 1032
100604 22:02:22 [Warning] Slave: Can't find record in 't1' Error_code: 1032
100604 22:02:22 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mbpro-bin.000001' position 1082
Examining the binlog will provide you with:
#100604 21:51:49 server id 1 end_log_pos 611 Update_rows: table id 2 flags: STMT_END_F
### UPDATE test.t1
### WHERE
### @1=1 /* INT meta=0 nullable=0 is_null=0 */
### SET
### @1=11 /* INT meta=0 nullable=0 is_null=0 */
### UPDATE test.t1
### WHERE
### @1=2 /* INT meta=0 nullable=0 is_null=0 */
### SET
### @1=12 /* INT meta=0 nullable=0 is_null=0 */
### UPDATE test.t1
### WHERE
### @1=3 /* INT meta=0 nullable=0 is_null=0 */
### SET
### @1=13 /* INT meta=0 nullable=0 is_null=0 */
### UPDATE test.t1
### WHERE
### @1=4 /* INT meta=0 nullable=0 is_null=0 */
### SET
### @1=14 /* INT meta=0 nullable=0 is_null=0 */
# at 611
There is no indicator or way to tell what key is missing (without parsing and checking each of these) so that you can quickly diagnose the failure and determine if you can simply insert the missing row to get replication back up. Yes the slave could still be out of sync but with terrabyte slaves it is often preferably to continue replication temporarily until a resync can be performed.
There is an slave_exec_mode=IDEMPOTENT but this is a bit drastic if a quick insert would suffice.
Suggested fix:
Provide more verbose logging.
Description: Currently when replicating to a slave using RBR or mixed (that does a row event) and a row event fails to find a matching key/row on the slave the error is not very descriptive: 100527 15:55:53 [ERROR] Slave SQL: Could not execute Update_rows event on table d1.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binlog.004706, end_log_pos 608591709, Error_code: 1032 100527 15:55:53 [Warning] Slave: Can't find record in 't1' Error_code: 1032 This requires the DBA to use mysqlbinlog with --verbose and/or --base64-output=DECODE-ROWS options to examine he RBR to deduce which row was not found, it's less obvious with large insert on duplicate key update type events which update hundreds of rows. Instead of HA_ERR_KEY_NOT_FOUND, there should be a more descriptive error since we know the key value that is missing (presumably in the before image). It would be nice to get errors as you do with duplicate keys indicating the key value or a before image/after image type message: SBR: Last_Error: Error 'Duplicate entry '45691' for key 1' on query. Default database: 'db1'. Query: 'INSERT INTO db1t1 (c1,c2,c3) VALUES ('1','2','3')'; RBR: Last_SQL_Error: Could not execute Write_rows event on table test.t1; Duplicate entry '2' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mbpro-bin.000001, end_log_pos 2707 Also with INSERT .. ON DUPLICATE KEY UPDATE with hundreds of rows this will translate into hundreds of inserts, updates in one event, one should be able to tell which one of the hundreds caused the problem. Current logging of the error only provides minimal information to the DBA who needs to access slave problems at time quickly (while updates are piling up on the master) and effectively decide his options of skipping, manual fixing or resyncing to bring replication back online. How to repeat: use binlog_format=row (or mixed and get a row event). create table t1 on master, use multi value insert on duplicate key update, truncate the slave directly, perform another insert..on duplicate key. mysql> create table t1 (id int primary key); Query OK, 0 rows affected (0.05 sec) mysql> insert into t1 (id) values (1),(2),(3),(4) on duplicate key update id=id+10; Query OK, 4 rows affected (0.00 sec) Records: 4 Duplicates: 0 Warnings: 0 * on slave * mysql> delete from t1 where id=2; Query OK, 1 row affected (0.00 sec) * Back to master * mysql> insert into t1 (id) values (1),(2),(3),(4) on duplicate key update id=id+10; Query OK, 8 rows affected (0.00 sec) Records: 4 Duplicates: 4 Warnings: 0 mysql> show slave status <snip> Last_Errno: 1032 Last_Error: Could not execute Update_rows event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mbpro-bin.000001, end_log_pos 1261 Skip_Counter: 0 Error log: 100604 22:02:01 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:3306',replication started in log 'mbpro-bin.000001' at position 756 100604 22:02:22 [ERROR] Slave SQL: Could not execute Update_rows event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mbpro-bin.000001, end_log_pos 1261, Error_code: 1032 100604 22:02:22 [Warning] Slave: Can't find record in 't1' Error_code: 1032 100604 22:02:22 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mbpro-bin.000001' position 1082 Examining the binlog will provide you with: #100604 21:51:49 server id 1 end_log_pos 611 Update_rows: table id 2 flags: STMT_END_F ### UPDATE test.t1 ### WHERE ### @1=1 /* INT meta=0 nullable=0 is_null=0 */ ### SET ### @1=11 /* INT meta=0 nullable=0 is_null=0 */ ### UPDATE test.t1 ### WHERE ### @1=2 /* INT meta=0 nullable=0 is_null=0 */ ### SET ### @1=12 /* INT meta=0 nullable=0 is_null=0 */ ### UPDATE test.t1 ### WHERE ### @1=3 /* INT meta=0 nullable=0 is_null=0 */ ### SET ### @1=13 /* INT meta=0 nullable=0 is_null=0 */ ### UPDATE test.t1 ### WHERE ### @1=4 /* INT meta=0 nullable=0 is_null=0 */ ### SET ### @1=14 /* INT meta=0 nullable=0 is_null=0 */ # at 611 There is no indicator or way to tell what key is missing (without parsing and checking each of these) so that you can quickly diagnose the failure and determine if you can simply insert the missing row to get replication back up. Yes the slave could still be out of sync but with terrabyte slaves it is often preferably to continue replication temporarily until a resync can be performed. There is an slave_exec_mode=IDEMPOTENT but this is a bit drastic if a quick insert would suffice. Suggested fix: Provide more verbose logging.