Bug #40258 Replication failure on Innodb + RBR + UPDATE + ORDER BY
Submitted: 22 Oct 2008 15:39 Modified: 13 Mar 2009 11:05
Reporter: Philip Stoev Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: Row Based Replication ( RBR ) Severity:S2 (Serious)
Version:5.1-rpl, 5.1-bzr OS:Any
Assigned to: Guangbao Ni CPU Architecture:Any

[22 Oct 2008 15:39] Philip Stoev
Description:
When executing a transactional concurrent workload containing UPDATE ... ORDER BY, row-based replication fails with a message:

# 05:01:02 Slave thread has stopped with error: Could not execute Delete_rows event on table test.table10_innodb_int_autoinc; Can't find record in 'table10_innodb_int_autoinc', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log master-bin.000001, end_log_pos 98547

or the slave and the master diverge.

How to repeat:
A test case will be uploaded shortly.
[22 Oct 2008 15:52] Philip Stoev
YY file

Attachment: bug40258.yy (application/octet-stream, text), 405 bytes.

[22 Oct 2008 15:52] Philip Stoev
ZZ file

Attachment: bug40258.zz (text/plain), 106 bytes.

[22 Oct 2008 15:54] Philip Stoev
To reproduce with the random query generator, clone the mysql-test-extra-6.0 tree and execute:

$ cd mysql-test-extra-6.0/mysql-test/gentest/
$ perl runall.pl \
  --basedir=/path/to/5.1-rpl \
  --grammar=conf/rpl_diff.yy \
  --gendata=conf/rpl_diff.zz \
  --rpl_mode=row \
  --engine=Innodb \
  --mysqld=--innodb-lock-wait-timeout=1 \
  --queries=2000

This will start the concurrent workload and will exit with an error once the slave thread stops.

This test uses char(1) columns, however attempts to insert and compare words from the english language, which are slightly longer.
[22 Oct 2008 16:26] Valeriy Kravchuk
Thank you for a bug report. Verified with main 5.1 tree from bzr. Master and slave diverge:

openxs@suse:/home2/openxs/bzr/mysql-test-extra-6.0/mysql-test/gentest> perl runall.pl --basedir=/home2/openxs/dbs/5.1 --rpl_mode=row --gendata=/home2/openxs/bug40258.zz --grammar=/home2/openxs/bug40258.yy --mysqld=--innodb_lock_wait_timeout=1 --engine=InnoDB --queries=2000
# 23:28:23 Please see http://inside.mysql.com/wiki/Category:RandomQueryGenerator for more information on this test framework.
# 23:28:23 Starting
# 23:28:23 # runall.pl \
# 23:28:23 # --basedir=/home2/openxs/dbs/5.1 \
# 23:28:23 # --rpl_mode=row \
# 23:28:23 # --gendata=/home2/openxs/bug40258.zz \
# 23:28:23 # --grammar=/home2/openxs/bug40258.yy \
# 23:28:23 # --mysqld=--innodb_lock_wait_timeout=1 \
# 23:28:23 # --engine=InnoDB \
# 23:28:23 # --queries=2000
# 23:28:23 Running perl mysql-test-run.pl --start-and-exit --mysqld=--innodb --mysqld="--innodb_lock_wait_timeout=1" --mysqld=--default-storage-engine=InnoDB --mysqld=--relay-log=master-relay-bin --mysqld=--loose-innodb --mysqld=--loose-falcon-debug-mask=2 --mysqld=--secure-file-priv= --mysqld=--max-allowed-packet=16Mb --master_port=19306 rpl_alter --slave_port=19308 .
Logging: mysql-test-run.pl --start-and-exit --mysqld=--innodb --mysqld=--innodb_lock_wait_timeout=1 --mysqld=--default-storage-engine=InnoDB --mysqld=--relay-log=master-relay-bin --mysqld=--loose-innodb --mysqld=--loose-falcon-debug-mask=2 --mysqld=--secure-file-priv= --mysqld=--max-allowed-packet=16Mb --master_port=19306 rpl_alter --slave_port=19308
MySQL Version 5.1.30
Using dynamic switching of binlog format
Using default engine 'InnoDB'
Using ndbcluster when necessary, mysqld supports it
Setting mysqld to support SSL connections
Binaries are debug compiled
Using MTR_BUILD_THREAD      = 0
Using MASTER_MYPORT         = 19306
Using MASTER_MYPORT1        = 19307
Using SLAVE_MYPORT          = 19308
Using SLAVE_MYPORT1         = 19309
Using SLAVE_MYPORT2         = 19310
Using NDBCLUSTER_PORT       = 9311
Using NDBCLUSTER_PORT_SLAVE = 9312
Using IM_PORT               = 9313
Using IM_MYSQLD1_PORT       = 9314
Using IM_MYSQLD2_PORT       = 9315
 - adding combinations
Killing Possible Leftover Processes
Removing Stale Files
Creating Directories
Installing Master Database
Installing Slave1 Database
=======================================================

TEST                           RESULT         TIME (ms)
-------------------------------------------------------

Servers started, exiting
# 23:28:33 Establishing replication, mode row ...
# 23:28:34 Starting
# 23:28:34  gentest.pl \
# 23:28:34  --gendata=/home2/openxs/bug40258.zz \
# 23:28:34  --engine=InnoDB \
# 23:28:34  --rpl_mode=row \
# 23:28:34  --threads=10 \
# 23:28:34  --queries=2000 \
# 23:28:34  --duration=3600 \
# 23:28:34  --dsn1=dbi:mysql:host=127.0.0.1:port=19306:user=root:database=test \
# 23:28:34  --grammar=/home2/openxs/bug40258.yy
# 23:28:34 Creating table table10_innodb_int_autoinc .
# 23:28:34 Reporters: ErrorLog, Backtrace
# 23:28:34 Validators: FalconErrors, ErrorMessageCorruption, ReplicationSlaveStatus
# 23:28:34 Starting 10 processes, 2000 queries each, duration 3600 seconds.
# 23:28:45 Started periodic reporting process...
# 23:29:13 Child process completed successfully.
# 23:29:43 Query: UPDATE `table10_innodb_int_autoinc` SET `char` = 'all' WHERE `char_key` < 'no' ORDER BY `char` failed: 1032 Can't find record in 'table10_innodb_int_autoinc'
# 23:29:55 Query: DELETE FROM `table10_innodb_int_autoinc` WHERE `char` < 'something' LIMIT 1 failed: 1205 Lock wait timeout exceeded; try restarting transaction
# 23:29:55 Query: DELETE FROM `table10_innodb_int_autoinc` WHERE `char` < "you're" LIMIT 1 failed: 1205 Lock wait timeout exceeded; try restarting transaction
# 23:29:56 Child process completed successfully.
# 23:30:02 Child process completed successfully.
# 23:30:06 Child process completed successfully.
# 23:30:07 Child process completed successfully.
# 23:30:08 Child process completed successfully.
# 23:30:09 Child process completed successfully.
# 23:30:11 Child process completed successfully.
# 23:30:11 Child process completed successfully.
# 23:30:13 Child process completed successfully.
# 23:30:18 Killing periodic reporting process with pid 4670...
# 23:30:23 Test completed successfully.
# 23:30:23 Waiting for slave to catch up..., file master-bin.000001, pos 882726 .
# 23:30:23 Dumping server on port 19306...
# 23:30:23 Dumping server on port 19308...
# 23:30:23 Comparing SQL dumps...
--- /tmp/server_4614_0.dump     2008-09-25 23:30:23.000000000 +0300
+++ /tmp/server_4614_1.dump     2008-09-25 23:30:23.000000000 +0300
@@ -691,6 +691,7 @@
 INSERT INTO `table10_innodb_int_autoinc` VALUES ('m',996,'o');
 INSERT INTO `table10_innodb_int_autoinc` VALUES ('m',998,'o');
 INSERT INTO `table10_innodb_int_autoinc` VALUES (NULL,10,NULL);
+INSERT INTO `table10_innodb_int_autoinc` VALUES (NULL,1623,'w');
 INSERT INTO `table10_innodb_int_autoinc` VALUES (NULL,2567,'t');
 INSERT INTO `table10_innodb_int_autoinc` VALUES (NULL,2568,'s');
 INSERT INTO `table10_innodb_int_autoinc` VALUES (NULL,2571,'h');
[10 Mar 2009 2:20] Guangbao Ni
I don't know why it alway firstly outputs "failed: 1054 Unknown column '%s' in 'where clause'.Further errors of this kind will be suppressed." error messages.

I use the yy anz zz files attached in the bug.

run it just as followings:

[gni@work-server gentest]$ perl runall.pl --basedir=/home/gni/mysql/bzr/bugteam-5.1/ --grammar=conf/bug40258.yy --gendata=conf/bug40258.zz --rpl_mode=row --mysqld=--innodb-lock-wait-timeout=1 --queries=2000
# 23:12:45 Please see http://inside.mysql.com/wiki/Category:RandomQueryGenerator for more information on this test framework.
# 23:12:45 Copyright (c) 2008 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms.
# 23:12:45 Starting
# 23:12:45 # runall.pl \
# 23:12:45 # --basedir=/home/gni/mysql/bzr/bugteam-5.1/ \
# 23:12:45 # --grammar=conf/bug40258.yy \
# 23:12:45 # --gendata=conf/bug40258.zz \
# 23:12:45 # --rpl_mode=row \
# 23:12:45 # --mysqld=--innodb-lock-wait-timeout=1 \
# 23:12:45 # --queries=2000
# 23:12:46 Running perl mysql-test-run.pl --start-and-exit --mysqld="--innodb-lock-wait-timeout=1" --skip-ndb --mysqld=--relay-log=slave-relay-bin --mysqld=--loose-innodb --mysqld=--loose-falcon-debug-mask=2 --mysqld=--secure-file-priv= --mysqld=--max-allowed-packet=16Mb --mysqld=--loose-innodb-status-file=1 --master_port=19306 rpl_alter --slave_port=19308 2>&1 .
=======================================================
  WARNING: Using mysql-test-run.pl version 1!
=======================================================
Logging: lib/v1/mysql-test-run.pl --start-and-exit --mysqld=--innodb-lock-wait-timeout=1 --skip-ndb --mysqld=--relay-log=slave-relay-bin --mysqld=--loose-innodb --mysqld=--loose-falcon-debug-mask=2 --mysqld=--secure-file-priv= --mysqld=--max-allowed-packet=16Mb --mysqld=--loose-innodb-status-file=1 --master_port=19306 rpl_alter --slave_port=19308
090309 23:12:48 [Warning] Forcing shutdown of 2 plugins
MySQL Version 5.1.33
Using dynamic switching of binlog format
Skipping ndbcluster
Setting mysqld to support SSL connections
Binaries are debug compiled
Using MTR_BUILD_THREAD      = 0
Using MASTER_MYPORT         = 19306
Using MASTER_MYPORT1        = 19307
Using SLAVE_MYPORT          = 19308
Using SLAVE_MYPORT1         = 19309
Using SLAVE_MYPORT2         = 19310
Using IM_PORT               = 9313
Using IM_MYSQLD1_PORT       = 9314
Using IM_MYSQLD2_PORT       = 9315
 - adding combinations
Killing Possible Leftover Processes
Removing Stale Files
Creating Directories
Installing Master Database
Installing Slave1 Database
=======================================================

TEST                           RESULT         TIME (ms)
-------------------------------------------------------

Servers started, exiting
# 23:12:55 Establishing replication, mode row ...
# 23:12:56 Starting
# 23:12:56  gentest.pl \
# 23:12:56  --gendata=conf/bug40258.zz \
# 23:12:56  --rpl_mode=row \
# 23:12:56  --threads=10 \
# 23:12:56  --queries=2000 \
# 23:12:56  --duration=3600 \
# 23:12:56  --dsn1=dbi:mysql:host=127.0.0.1:port=19306:user=root:database=test \
# 23:12:56  --grammar=conf/bug40258.yy
# 23:12:57 Creating table table10_int_autoinc .
# 23:12:57 Reporters: ErrorLog, Backtrace
# 23:12:57 Validators: ErrorMessageCorruption, ReplicationSlaveStatus
# 23:12:57 Starting 10 processes, 2000 queries each, duration 3600 seconds.
# 23:12:57 Query: UPDATE `table10_int_autoinc` SET `char` = 'been' WHERE `char` > e ORDER BY `char` failed: 1054 Unknown column '%s' in 'where clause'.Further errors of this kind will be suppressed.
# 23:12:57 Query: UPDATE `table10_int_autoinc` SET `char` = c WHERE `char_key` = 'my' ORDER BY `char_key` failed: 1054 Unknown column '%s' in 'field list'. Further errors of this kind will be suppressed.
# 23:12:57 Query: UPDATE `table10_int_autoinc` SET `char` = 'he' WHERE `char` < f ORDER BY `char_key` failed: 1054 Unknown column '%s' in 'where clause'. Further errors of this kind will be suppressed.
# 23:12:57 Query: UPDATE `table10_int_autoinc` SET `char` = k WHERE `char` > 'been' ORDER BY `char` failed: 1054 Unknown column '%s' in 'field list'. Further errors of this kind will be suppressed.
# 23:12:58 Started periodic reporting process...
# 23:12:58 Query: DELETE FROM `table10_int_autoinc` WHERE `char_key` = o LIMIT 1 failed: 1054 Unknown column '%s' in 'where clause'. Further errors of this kind will be suppressed.
# 23:12:58 Query: INSERT INTO `table10_int_autoinc` ( `char_key` ) VALUES (g ) failed: 1054 Unknown column '%s' in 'field list'. Further errors of this kind will be suppressed.
# 23:12:58 Query: INSERT INTO `table10_int_autoinc` ( `char` ) VALUES ( d )failed: 1054 Unknown column '%s' in 'field list'. Further errors of this kind will be suppressed.
# 23:12:58 Query: UPDATE `table10_int_autoinc` SET `char_key` = 'would' WHERE `char_key` = l ORDER BY `char` failed: 1054 Unknown column '%s' in 'whereclause'. Further errors of this kind will be suppressed.
# 23:12:59 Query: UPDATE `table10_int_autoinc` SET `char` = j WHERE `char` = m ORDER BY `char` failed: 1054 Unknown column '%s' in 'where clause'. Further errors of this kind will be suppressed.
[10 Mar 2009 7:14] Philip Stoev
Guangbao Ni, this test uses random queries. Therefore, messages such as "unknown column", "column specified twice", and transaction errors are to be expected due to the random nature of the test.

Please ignore all those errors and focus on the replication-related errors, that is, on messages that the slave has stopped or that the master and the slave have diverged.
[10 Mar 2009 7:55] Guangbao Ni
Hello Philip Stoev,

Besides above error messages, many "1213 deadlock found when trying to get lock; try restarting transaction" error messages are outputed.
But at last, the test case can be excuted succesfully from my test.

So I want to know whether you still get the problem with newest source codes of 5.1/6.0-bugteam tree.

Thanks,

/Guangbao
[13 Mar 2009 11:05] Philip Stoev
This bug is no longer repeatable on 5.1-bugteam and 6.0-main.