MySQL Bugs: #43667: Maria replication failure in row-based/mixed-mode replication

Bug #43667	Maria replication failure in row-based/mixed-mode replication
Submitted:	16 Mar 2009 8:34	Modified:	24 Apr 2009 13:13
Reporter:	Philip Stoev	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Row Based Replication ( RBR )	Severity:	S2 (Serious)
Version:	6.0.10	OS:	Any
Assigned to:	Luis Soares	CPU Architecture:	Any

Description:
When executing a relatively simple replication workload (the rpl_sys SystemQA test), replication fails with Maria in row-based and mixed mode with the following error:

090309 10:26:17 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.insdel1_tbl; Can't find record in 'insdel1_tbl', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log master-bin.000001, end_log_pos 18855441, Error_code: 1032
090309 10:26:17 [ERROR] Slave: Can't find record in 'insdel1_tbl' Error_code: 1032
090309 10:26:17 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 18825680

How to repeat:
A test case will be uploaded shortly.

Test case for bug 43667

Attachment: bug43667.zip (application/x-zip-compressed, text), 4.94 KiB.

To reproduce, unpack the ZIP in mysql-test/suite/bug43667 and then run:

MTR_VERSION=1 perl mysql-test-run.pl --start-and-exit  --mysqld=--innodb rpl_alter

mysql -uroot --socket=var/tmp/slave.sock -e 'change master to master_port=9306 , master_host="127.0.0.1" , master_user="root"'

mysql -uroot --socket=var/tmp/slave.sock -e 'start slave'

engine_type=Maria MTR_VERSION=1 perl mysql-test-run.pl --stress --extern --user=root --socket=var/tmp/master.sock --stress-init-file=bug43667_init.txt --stress-test-file=bug43667_run.txt --stress-suite=bug43667 --stress-test-duration=60000 --stress-threads=50

Please disregard all output from the test and the contents of the var/stress directory. Instead, issue SHOW SLAVE STATUS on the slave to detect the issue.

I have tried to simplify the test case as much as possible. The problem was also observed with --stress-threads=10 . The slave will lag behind the master, so the issue will become visible about 30 seconds into the test.

Please let me know if a better test case is required.

Analysis
========

  Conducted with tool provided in the bug report. Verified behavior as mentioned.

Observation
===========

 * Slave stops at position 12271980 with: 

 Could not execute Delete_rows event on table test.insdel1_tbl; Can't find record in 'insdel1_tbl', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log master-bin.000001, end_log_pos 12271980, Error_code: 1032

 * Event at start position = position stopped:

../client/mysqlbinlog -v --start-position=12271980 var/log/master-bin.000001 | less
(...)
### DELETE FROM test.insdel1_tbl
### WHERE
###   @1=895

  => We can find that it is a DELETE where @1=895 ...

 * Checking for inserts into test.insdel1_tbl before position =
   position stopped:

  => Several

 * Checking for inserts into insdel1_tbl, WHERE @1=895, before
   position = position stopped and 895 "key" appears as well:

../client/mysqlbinlog -v --stop-position=12271980 var/log/master-bin.000001 | grep -A 2 "INSERT INTO test.insdel1_tbl" | grep 895

  => n/a

Conclusion
==========

Binlog event ordering seems different from execution history, thence
slave will replay row events in wrong order (DELETE before INSERT),
causing the slave to fail by not finding the row to be deleted.

Regarding the observation above:

The following INSERT appears in the binlog *after* the DELETE event mentioned above.

### INSERT INTO test.insdel1_tbl
### SET
###   @1=895
###   @2=1239987496
###   @3='root@localhost'
###   @4='f2433a1a-2b70-11de-b0ab-002185344e3a'
###   @5=19
###   @6='Going to test RBR for MySQL'

# Test case showing that with non-transactional storage engines and
# autocommit=off, some events are logged in different order regarding
# execution. This results in different execution histories on master
# and slave.

# In what follows, although an unsafe UPDATE statement (UUID) is
# performed in table t1 (in connection conn2), before the DELETE in
# (connection conn1) on the same table, these are ordered differently
# when binlogging.

# The binlogging of the unsafe UPDATE event (row event) does not obey
# the execution order, but instead it obeys the commit order. This
# makes the DELETE event to appear before the unsafe UPDATE, because
# conn2 commits first than conn1.

# Also, note that apparentely, only row events get the BEGIN and
# COMMIT transaction boundaries.

source include/have_log_bin.inc;
source include/have_maria.inc;

connection default;
CREATE TABLE t1 (a text) engine=Maria;
INSERT INTO t1 VALUES ("1");

connect (conn1, 127.0.0.1,root,,);
connect (conn2, 127.0.0.1,root,,);

echo ***** conn1: begin;
connection conn1;
SET AUTOCOMMIT= 0;
BEGIN;

echo ***** conn2: begin and update t1;
connection conn2;
SET AUTOCOMMIT= 0;
BEGIN;
UPDATE t1 SET a = 'a';
UPDATE t1 SET a = UUID();
SELECT * FROM t1;

echo ***** conn1: delete from t1 and commit;
connection conn1;
SELECT * FROM t1;
DELETE FROM t1;
COMMIT;

echo ***** conn2: commit;
connection conn2;
COMMIT;

disconnect conn1;
disconnect conn2;
connection default;

DROP TABLE t1;
source include/show_binlog_events.inc;

exit;

What follows is the dump of the binlog for the execution above for:

=== RBR mode

master-bin.000001	#	Query	#	#	use `test`; CREATE TABLE t1 (a text) engine=Maria
master-bin.000001	#	Query	#	#	use `test`; BEGIN
master-bin.000001	#	Table_map	#	#	table_id: # (test.t1)
master-bin.000001	#	Write_rows	#	#	table_id: # flags: STMT_END_F
master-bin.000001	#	Query	#	#	use `test`; COMMIT
master-bin.000001	#	Query	#	#	use `test`; BEGIN
master-bin.000001	#	Table_map	#	#	table_id: # (test.t1)
master-bin.000001	#	Delete_rows	#	#	table_id: # flags: STMT_END_F
master-bin.000001	#	Query	#	#	use `test`; COMMIT
master-bin.000001	#	Query	#	#	use `test`; BEGIN
master-bin.000001	#	Table_map	#	#	table_id: # (test.t1)
master-bin.000001	#	Update_rows	#	#	table_id: # flags: STMT_END_F
master-bin.000001	#	Table_map	#	#	table_id: # (test.t1)
master-bin.000001	#	Update_rows	#	#	table_id: # flags: STMT_END_F
master-bin.000001	#	Query	#	#	use `test`; COMMIT
master-bin.000001	#	Query	#	#	use `test`; DROP TABLE t1

== Mixed Mode

master-bin.000001	#	Query	#	#	use `test`; CREATE TABLE t1 (a text) engine=Maria
master-bin.000001	#	Query	#	#	use `test`; INSERT INTO t1 VALUES ("1")
master-bin.000001	#	Query	#	#	use `test`; UPDATE t1 SET a = 'a'
master-bin.000001	#	Query	#	#	use `test`; UPDATE t1 SET a = UUID()
master-bin.000001	#	Query	#	#	use `test`; DELETE FROM t1
master-bin.000001	#	Query	#	#	use `test`; DROP TABLE t1

== Statement Mode

show binlog events from <binlog_start>;
Log_name	Pos	Event_type	Server_id	End_log_pos	Info
master-bin.000001	#	Query	#	#	use `test`; CREATE TABLE t1 (a text) engine=Maria
master-bin.000001	#	Query	#	#	use `test`; INSERT INTO t1 VALUES ("1")
master-bin.000001	#	Query	#	#	use `test`; UPDATE t1 SET a = 'a'
master-bin.000001	#	Query	#	#	use `test`; DELETE FROM t1
master-bin.000001	#	Query	#	#	use `test`; BEGIN
master-bin.000001	#	Table_map	#	#	table_id: # (test.t1)
master-bin.000001	#	Update_rows	#	#	table_id: # flags: STMT_END_F
master-bin.000001	#	Query	#	#	use `test`; COMMIT
master-bin.000001	#	Query	#	#	use `test`; DROP TABLE t1

This seems a duplicate of BUG#40278.

Correction in the test case comments, *conn1* commits before *conn2*:

# This makes the DELETE event to appear before the unsafe UPDATE, because
# *conn1* commits first than *conn2*.

As analysis shows, this is duplicate of BUG#40278.