Bug #40188 Replication failure on RBR + Innodb + implicit commit + truncate/drop table
Submitted: 20 Oct 2008 19:51 Modified: 21 May 2009 2:21
Reporter: Philip Stoev Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Locking Severity:S2 (Serious)
Version:6.0 OS:Any
Assigned to: Davi Arnaut CPU Architecture:Any

[20 Oct 2008 19:51] Philip Stoev
Description:
When executing a concurrent non-trivial replication workload, row-based replication fails with the following message:

081020 22:45:45 [ERROR] Slave SQL: Error 'Table 'test.v' doesn't exist' on opening tables, Error_code: 1146
081020 22:45:45 [Warning] Slave: Table 'test.v' doesn't exist Error_code: 1146

The workload needs to contain the following:

* SET AUTOCOMMIT = ON|OFF;
* START TRANSACTION|COMMIT|ROLLBACK;
* SET BINLOG_FORMAT = 'ROW'; (though no changes between binlog formats)
* CREATE|DROP|TRUNCATE TABLE;
* INSERT|UPDATE|DELETE;

As noted above, this is not exactly a trivial workload, however if a database layer adds the SET AUTOCOMMIT and SET BINLOG_FORMAT statements, the situation becomes plausible.

How to repeat:
A test case will be uploaded shortly.
[20 Oct 2008 19:52] Philip Stoev
YY file

Attachment: bug40188.yy (application/octet-stream, text), 773 bytes.

[20 Oct 2008 19:52] Philip Stoev
ZZ file

Attachment: bug40188.zz (text/plain), 125 bytes.

[20 Oct 2008 20:00] Philip Stoev
To reproduce with the Random Query Generator, clone the mysql-test-extra-6.0 tree and then execute:

$ cd mysql-test-extra-6.0/mysql-test/gentest
$ perl runall.pl \
  --basedir=/path/to/mysql-5.1/ \
  --rpl_mode=row \
  --gendata=bug40188.zz \
  --grammar=bug40188.yy \
  --mysqld=--innodb_lock_wait_timeout=1 \
  --queries=100000 \
  --duration=1200 \
  --engine=Innodb

This will start a concurrent test and will abort as soon as the slave thread stops.
[20 Oct 2008 20:08] Valeriy Kravchuk
Thank you for a problem report. 

How many threads should I start? I get different errors (Unknown table 'a' with 5 threads, Duplicate entry '5' for key PRIMARY with 2 threads etc).
[20 Oct 2008 20:11] Valeriy Kravchuk
Verified just as described:

openxs@suse:/home2/openxs/bzr/mysql-test-extra-6.0/mysql-test/gentest> perl runall.pl --basedir=/home2/openxs/dbs/5.1 --rpl_mode=row --gendata=/home2/openxs/bug40188.zz --grammar=/home2/openxs/bug40188.yy --mysqld=--innodb_lock_wait_timeout=1 --queries=1000000 --duration=1200 --engine=InnoDB
# 03:17:48 Please see http://inside.mysql.com/wiki/Category:RandomQueryGenerator for more information on this test framework.
# 03:17:48 Starting
# 03:17:48 # runall.pl \
# 03:17:48 # --basedir=/home2/openxs/dbs/5.1 \
# 03:17:48 # --rpl_mode=row \
# 03:17:48 # --gendata=/home2/openxs/bug40188.zz \
# 03:17:48 # --grammar=/home2/openxs/bug40188.yy \
# 03:17:48 # --mysqld=--innodb_lock_wait_timeout=1 \
# 03:17:48 # --queries=1000000 \
# 03:17:48 # --duration=1200 \
# 03:17:48 # --engine=InnoDB
# 03:17:48 Running perl mysql-test-run.pl --start-and-exit --mysqld=--innodb --mysqld="--innodb_lock_wait_timeout=1" --mysqld=--default-storage-engine=InnoDB --mysqld=--relay-log=master-relay-bin --mysqld=--loose-innodb --mysqld=--loose-falcon-debug-mask=2 --mysqld=--secure-file-priv= --mysqld=--max-allowed-packet=16Mb --master_port=19306 rpl_alter --slave_port=19308 .
Logging: mysql-test-run.pl --start-and-exit --mysqld=--innodb --mysqld=--innodb_lock_wait_timeout=1 --mysqld=--default-storage-engine=InnoDB --mysqld=--relay-log=master-relay-bin --mysqld=--loose-innodb --mysqld=--loose-falcon-debug-mask=2 --mysqld=--secure-file-priv= --mysqld=--max-allowed-packet=16Mb --master_port=19306 rpl_alter --slave_port=19308
MySQL Version 5.1.30
Using dynamic switching of binlog format
Using default engine 'InnoDB'
Using ndbcluster when necessary, mysqld supports it
Setting mysqld to support SSL connections
Binaries are debug compiled
Using MTR_BUILD_THREAD      = 0
Using MASTER_MYPORT         = 19306
Using MASTER_MYPORT1        = 19307
Using SLAVE_MYPORT          = 19308
Using SLAVE_MYPORT1         = 19309
Using SLAVE_MYPORT2         = 19310
Using NDBCLUSTER_PORT       = 9311
Using NDBCLUSTER_PORT_SLAVE = 9312
Using IM_PORT               = 9313
Using IM_MYSQLD1_PORT       = 9314
Using IM_MYSQLD2_PORT       = 9315
 - adding combinations
Killing Possible Leftover Processes
Removing Stale Files
Creating Directories
Installing Master Database
Installing Slave1 Database
=======================================================

TEST                           RESULT         TIME (ms)
-------------------------------------------------------

Servers started, exiting
# 03:17:54 Establishing replication, mode row ...
# 03:17:54 Starting
# 03:17:54 # gentest.pl \
# 03:17:54 # --gendata=/home2/openxs/bug40188.zz \
# 03:17:54 # --engine=InnoDB \
# 03:17:54 # --rpl_mode=row \
# 03:17:54 # --threads=10 \
# 03:17:54 # --queries=1000000 \
# 03:17:54 # --duration=1200 \
# 03:17:54 # --dsn1=dbi:mysql:host=127.0.0.1:port=19306:user=root:database=test \
# 03:17:54 # --grammar=/home2/openxs/bug40188.yy
# 03:17:54 Creating table table1_innodb_int_autoinc .
# 03:17:55 Reporters: ErrorLog, Backtrace
# 03:17:55 Validators: FalconErrors, ErrorMessageCorruption, ReplicationSlaveStatus
# 03:17:55 Starting 10 processes, 1000000 queries each, duration 1200 seconds.
# 03:17:55 Started periodic reporting process...
# 03:17:57 Query: CREATE TABLE IF NOT EXISTS d SELECT * FROM `table1_innodb_int_autoinc` ; TRUNCATE TABLE x failed: 1205 Lock wait timeout exceeded; try restarting transaction
# 03:18:01 Query: CREATE TABLE IF NOT EXISTS z SELECT * FROM `e` ; TRUNCATE TABLE n failed: 1205 Lock wait timeout exceeded; try restarting transaction
# 03:18:02 Query: CREATE TABLE IF NOT EXISTS x SELECT * FROM `table1_innodb_int_autoinc` ; DROP TABLE IF EXISTS k failed: 1205 Lock wait timeout exceeded; try restarting transaction
# 03:18:02 Query: CREATE TABLE IF NOT EXISTS v SELECT * FROM `table1_innodb_int_autoinc` ; TRUNCATE TABLE m failed: 1205 Lock wait timeout exceeded; try restarting transaction
# 03:18:04 Query: CREATE TABLE IF NOT EXISTS z SELECT * FROM `table1_innodb_int_autoinc` ; TRUNCATE TABLE i failed: 1205 Lock wait timeout exceeded; try restarting transaction
# 03:18:04 Query: CREATE TABLE IF NOT EXISTS q SELECT * FROM `table1_innodb_int_autoinc` ; TRUNCATE TABLE h failed: 1205 Lock wait timeout exceeded; try restarting transaction
# 03:18:05 Slave thread has stopped with error: Error 'Table 'test.e' doesn't exist' on opening tables
# 03:18:05 Slave thread has stopped with error: Error 'Table 'test.e' doesn't exist' on opening tables
# 03:18:05 Slave thread has stopped with error: Error 'Table 'test.e' doesn't exist' on opening tables
# 03:18:05 Killing periodic reporting process with pid 7319...
# 03:18:05 Slave thread has stopped with error: Error 'Table 'test.e' doesn't exist' on opening tables
# 03:18:05 Slave thread has stopped with error: Error 'Table 'test.e' doesn't exist' on opening tables
# 03:18:05 Slave thread has stopped with error: Error 'Table 'test.e' doesn't exist' on opening tables
# 03:18:05 Slave thread has stopped with error: Error 'Table 'test.e' doesn't exist' on opening tables
# 03:18:05 Slave thread has stopped with error: Error 'Table 'test.e' doesn't exist' on opening tables
# 03:18:05 Slave thread has stopped with error: Error 'Table 'test.e' doesn't exist' on opening tables
# 03:18:05 Slave thread has stopped with error: Error 'Table 'test.e' doesn't exist' on opening tables
# 03:18:10 Killing child process with pid 7328...
# 03:18:10 Killing child process with pid 7323...
# 03:18:10 Killing child process with pid 7320...
# 03:18:10 Killing child process with pid 7335...
# 03:18:10 Killing child process with pid 7325...
# 03:18:10 Killing child process with pid 7333...
# 03:18:10 Killing child process with pid 7334...
# 03:18:10 Killing child process with pid 7331...
# 03:18:10 Killing child process with pid 7322...
# 03:18:10 Test completed with failure status 103.
[20 Oct 2008 20:15] Philip Stoev
The error from this bug report happened at 10 threads for me. Can you please paste the entire text of the other errors you are getting along with how many threads are needed to produce them?
[20 Apr 2009 16:58] Philip Stoev
6.0-main is still affected by the original test case:

# 19:58:06 Slave SQL thread has stopped with error: Error 'Table 'test.m' doesn't exist' on query. Default database: 'test'. Query: 'TRUNCATE TABLE m'
[4 May 2009 14:43] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/73313

2772 Davi Arnaut	2009-05-04
      Bug#40188: Replication failure on RBR + Innodb + implicit commit + truncate/drop table
      
      The problem is that in some cases transaction-long metadata locks
      could be released before the transaction was committed. This could
      happen when a active transaction was ended by a "START TRANSACTION"
      or "BEGIN" statement, in which case the metadata locks would be
      released before the actual commit of the active transaction.
      
      The solution is to defer the release of metadata locks to after the
      transaction has been implicitly committed. No test case is provided
      as the effort to provide one is too disproportional to the size of
      the fix.
     @ sql/transaction.cc
        Commit before transaction-long metadata locks are released.
[6 May 2009 18:52] Konstantin Osipov
The bug is tagged for 5.1, even though it's repeatable only against 6.0.
Please re-triage.
[6 May 2009 20:40] Davi Arnaut
Queued to 6.0-runtime
[16 May 2009 8:41] Bugs System
Pushed into 6.0.12-alpha (revid:alik@sun.com-20090516083402-0avycdy7w6dnn0tv) (version source revid:serg@mysql.com-20090512164111-jz224w5781lhe7wd) (merge vers: 6.0.12-alpha) (pib:6)
[21 May 2009 2:21] Paul DuBois
Noted in 6.0.12 changelog.

If a transaction was implicitly committed by a START TRANSACTION or
BEGIN statement, metadata locks held by the transaction incorrectly
could be released before the commit actually occurred.