Bug #20677 | Race condition between table discovery and insert | ||
---|---|---|---|
Submitted: | 24 Jun 2006 17:28 | Modified: | 26 Dec 2006 16:33 |
Reporter: | Kristian Nielsen | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Cluster: Replication | Severity: | S2 (Serious) |
Version: | mysql-5.1 | OS: | Linux (Linux) |
Assigned to: | CPU Architecture: | Any | |
Tags: | 5.1.12 |
[24 Jun 2006 17:28]
Kristian Nielsen
[24 Jun 2006 20:17]
Kristian Nielsen
What happens here is that server1 is supposed to get NDB events from the operations (create, insert) on server2, and inject them into the local binlog. But from the failure, occasionally the insert events do not arrive (or arrive too late perhaps), so the local binlog contains only the CREATE event (from server 2) and the local DROP event, not the INSERT events. The comments in the test seem to imply that the DELETE on server1 should make sure that all events are recived first, but clearly that does not always happen. This could indicate a bug, or it could be that the events are in fact received, but are not yet injected to the local binlog due to local server1 thread scheduling. I would like some cluster expert's opinion on whether this is a real bug or not, though somehow it does not seem right that a DELETE should cause cluster to wait for an acknowledge from binlog injection. And if no such wait for acknowledge, it seems that this is not a bug but just a race in the test case? For now, I will try to make a patch that removes the race in the test case by explicitly waiting for the INSERT events to be injected before checking binlog.
[24 Jun 2006 21:12]
Kristian Nielsen
I tried extending the test case to loop, waiting for the INSERT event. This shows that in the failure case, the INSERT events are not just delayed, they are in fact getting lost! Found something interesting in the master1.err log. In the normal (success) case, it contains this: 060625 0:06:14 [Note] NDB Binlog: Node: 1, subscribe from node 5, Subscriber bitmask 020 060625 0:06:14 [Note] NDB Binlog: Node: 1, subscribe from node 4, Subscriber bitmask 030 060625 0:06:14 [Note] NDB Binlog: Node: 2, subscribe from node 5, Subscriber bitmask 020 060625 0:06:14 [Note] NDB Binlog: Node: 2, subscribe from node 4, Subscriber bitmask 030 060625 0:06:14 [Note] NDB Binlog: CREATE TABLE Event: REPL$test/t2 060625 0:06:14 [Note] NDB Binlog: logging ./test/t2 But in the failure case, it contains instead this: 060624 23:48:56 [Note] NDB Binlog: CREATE TABLE Event: REPL$test/t2 060624 23:48:56 [Note] NDB Binlog: logging ./test/t2 060624 23:48:56 [Note] NDB Binlog: Node: 1, subscribe from node 5, Subscriber bitmask 020 060624 23:48:56 [Note] NDB Binlog: Node: 1, subscribe from node 4, Subscriber bitmask 030 060624 23:48:56 [Note] NDB Binlog: Node: 2, subscribe from node 5, Subscriber bitmask 020 060624 23:48:56 [Note] NDB Binlog: Node: 2, subscribe from node 4, Subscriber bitmask 030 Note that in the failure case, the subscription seems to happen *after* table creation, and in the success case *before* the table creation. This was verified in several success and several failure runs. So it looks like a late subscription causes the insert (but not create) events to get lost.
[25 Jun 2006 7:06]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/8195
[25 Jun 2006 7:12]
Kristian Nielsen
I have committed a patch that starts the test with a dummy table CREATE and DROP, to ensure that event subscription is set up properly. This fixes the test failues (I was able to run the test 300 times in a loop with no failures). Tomas, I need your help to decide whether this is a valid fix (so it is just the testcase that is broken), or whether there is a real bug to be fixed.
[26 Jun 2006 17:19]
Kristian Nielsen
Quoting Tomas: "The real problem is that the insert is allowed before the discovery of the table has completed totally... if only occurs in this special case when you start up a mysqld at the same time as you do a create table..." I will attach the master.err and master1.err log files from when the failure occurs. For now, I have pushed to 5.1.12 the workaround with extra create/drop table at the start of the test. This fixes the test failure. The bug is therefore no longer a P1/Showstopper. It only occurs if one mysqld issues create table simultaneously with the start of another mysqld.
[26 Jun 2006 17:19]
Kristian Nielsen
master error log
Attachment: master.err (application/octet-stream, text), 4.74 KiB.
[26 Jun 2006 17:20]
Kristian Nielsen
master1 error log
Attachment: master1.err (application/octet-stream, text), 3.65 KiB.