MySQL Bugs: #36101: Row-based replication drops events when using multiple tables

Bug #36101	Row-based replication drops events when using multiple tables
Submitted:	15 Apr 2008 20:21	Modified:	30 Apr 2008 7:25
Reporter:	Eric Day	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Row Based Replication ( RBR )	Severity:	S2 (Serious)
Version:	5.1.23, 6.0.4	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	Drop, Event, lost, replication, row, row-based

Description:
I've discovered a bug where if you have two servers running with row-based replication between them, simultaneous updates to the master on different MyISAM tables will cause events to be lost on the slave. I've reproduced this bug on the latest 5.1.* and 6.* versions. many of the details are best described by running the attached test scripts, but here is a summary:

I created two tables on the master (test.a and test.b) and inserted 10,000 rows into each at the same time. After the replication queues flushed, I grabbed a row count from each (master and slave):

MASTER:
mysql> SELECT COUNT(*) FROM test.a;
+----------+
| COUNT(*) |
+----------+
|    10000 |
+----------+
1 row in set (0.00 sec)
mysql> SELECT COUNT(*) FROM test.b;
+----------+
| COUNT(*) |
+----------+
|    10000 |
+----------+
1 row in set (0.00 sec)

SLAVE:
mysql> SELECT COUNT(*) FROM test.a;
+----------+
| COUNT(*) |
+----------+
|     7881 |
+----------+
1 row in set (0.00 sec)
mysql> SELECT COUNT(*) FROM test.b;
+----------+
| COUNT(*) |
+----------+
|     7451 |
+----------+
1 row in set (0.00 sec)

When examining the rows closely, you can see (random) lapses in the AUTO_INCREMENTED keys on the slave, while they are continuous on the master. I checked out the latest source from bit-keeper and built a debugging version of the server, but due to the introduced debugging delay, I was not able to reproduce. This seems to imply that it is time sensitive. Also, when testing with InnoDB tables, I was not able to reproduce. This may be due to the fact that InnoDB can't insert quite as fast as MyISAM (at least on the setup I'm using).

When examining the binlog and relay log on the slave, it appears the relay log has all events (so they are getting pulled from master by slave IO thread just fine), but the slave binlog has missing events. This implies the bug lies somewhere in the slave SQL thread.

How to repeat:
I've scripted up a test to reproduce the bug. It ran it on both Linux and Mac OSX, but should work for any unix variant. There are two scripts:

repl_bug_ins.sh - This just generates the SQL commands to created a table and insert a variable number of rows.

repl_bug.sh - This is the main driver script and needs the first three variables configured. They should be self explanatory from the comments. This will configure and start up a master and slave server, setup replication, perform the parallel inserts (using repl_bug_ins.sh), wait for the servers to sync, and then print some debugging info to show that the bug was reproduced.

Suggested fix:
As I mentioned before, it seems to be a time sensitive bug in the slave SQL thread, so that's where I would begin looking. I'm currently getting up to speed on the internals of the MySQL source code (for other projects, feature additions, and to help with bug fixes such as this) so I should have more details later.

Script to reproduce the bug

Attachment: repl_bug.sh (application/x-sh, text), 4.05 KiB.

Create/Insert script that repl_bug.sh needs

Attachment: repl_bug_ins.sh (application/x-sh, text), 236 bytes.

Thank you for the report.

I can not repeat described behavior in my environment with current development sources. Could you please try with current verion 5.1.24 and inform us if bug still exists.

I just re-ran the tests on my system and it appears to be fixed in 5.1.24. I looked through the changelog and realized it was probably Bug #29020 (sorry I didn't find that the first time around). Thanks for your time!