Bug #47151 Replication failure on highly concurrent DROP DATABASE and othe DDL
Submitted: 5 Sep 2009 16:21 Modified: 7 Dec 2009 13:00
Reporter: Philip Stoev Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:6.0-codebase OS:Any
Assigned to: Daogang Qu CPU Architecture:Any

[5 Sep 2009 16:21] Philip Stoev
Description:
When executing a workload that is especially heavy on DROP DATABASE, the slave will abort with the following message:

Last_SQL_Error: Error 'Unknown database 'testdb_N'' on query. Default database: 'test'. Query: 'CREATE TABLE IF NOT EXISTS testdb_N . t1_base_2_N LIKE test . table0_int_autoinc'

In addition, mysqlbinlog will refuse to dump the entire master binary log with the following message:

ERROR: Error in Log_event::read_log_event(): 'Found invalid event in binary log', data_len: 59, event_type: 19

Those issues are also present in 5.1, however there they can be expected due to metadata locking. However, in 5.4 those issues should never happen. It seems that a window of opportunity is left for replication failure even with MDL.

How to repeat:
To reproduce with the RQG, please run the attached grammar as follows:

$ perl runall.pl \
  --basedir=/build/bzr/mysql-next-bugfixing \
  --gendata=conf/WL5004_data.zz \
  --threads=20 \
  --rpl_mode=row \
  --queries=1K \
  --duration=20 \
  --grammar=/path/to/grammar/file.yy \
  --rpl_mode=row

Unfortunately the grammar could not be simplified to the bare essentials, so it may contain items that are not strictly required to reproduce the bug. However, on the other hand, it seems that commands on the master such as SHOW TABLES, that should not normally have an impact on replication, do play a role in triggering this bug. It may, however, be only due to timing oddities and not due to the SHOW statements themselves.

Since the slave error message can vary, please continue to run this test with --rpl_mode=row and --rpl_mode=statement until replication is always successfull and mysqlbinlog can dump the entire the binary log.
[5 Sep 2009 16:23] Philip Stoev
Grammar for bug 47151

Attachment: bug47151.yy (application/octet-stream, text), 23.85 KiB.

[26 Nov 2009 16:36] Philip Stoev
bug47151-large.yy

Attachment: bug47151-large.yy (application/octet-stream, text), 30.88 KiB.

[26 Nov 2009 16:38] Philip Stoev
I just uploaded a MDL-targeted RQG grammar that is particularily productive when it comes to replication failures. To run:

 perl runall.pl \
  --gendata=conf/WL5004_data.zz \
  --rpl_mode=row \
  --duration=60 \
  --queries=100K \
  --basedir=/build/bzr/6.0-codebase-bugfixing \
  --mysqld=--log-output=file \
  --grammar=conf/bug47151-large.yy \
  --mem
[26 Nov 2009 16:56] Philip Stoev
Requesting a re-triage. Since this bug was filed, it was discovered that various replication issues happen at lower concurrencies.

Also, the purpose of the MDL locking is to prevent such issues completely, regardless of the concurrency level and the realism of the scenario. Therefore, please give this bug a higher tag and let's use it to figure out the root issue.
[3 Dec 2009 12:45] Philip Stoev
I meant schema DDL is not protected, so failures on DROP schema are to be expected.
[6 Dec 2009 12:01] Daogang Qu
DROP DATABASE constructs a list of tables to drop, by performing a
read on the filesystem directory.

Obviously, this has a race: if between the scan and actual 
DROP the contents of the directory is changed, DROP DATABASE will
not drop everything, or will try to drop something that might be
no longer there.

Offending operatoins include: CREATE/DROP TRIGGER, ALTER TABLE
db1.t1 RENAME db2.t2, other operations that move directory files
around.

The only solution for the problem is to make sure that DDL
operations take "scoped" lock, which MySQL doesn't do.

I.e. DROP TABLE or RENAME TABLE needs not only take an exclusive
lock on the table itself (which it currently does), but an
intention exclusive lock on the database name.
[7 Dec 2009 5:02] Daogang Qu
Hi Philip,
According to the above root cause, bug#47151 should be closed. Are you agree?
[7 Dec 2009 5:39] MySQL Verification Team
Daogang, I can't comment on the MDL stuff, but are you saying it's acceptable to have binlog corruption noted in the bug description too?
[7 Dec 2009 7:35] Daogang Qu
The problem in the bug description will disappear after the above root cause is resolved.
[7 Dec 2009 13:00] Philip Stoev
I am closing this bug. Will open a new one if a replication failure is observed in MDL-controlled operations.