MySQL Bugs: #31340: Cluster replication failure on simple table with auto

Bug #31340	Cluster replication failure on simple table with auto_increment
Submitted:	2 Oct 2007 12:22	Modified:	27 Oct 2010 20:06
Reporter:	Hartmut Holzgraefe	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Replication	Severity:	S2 (Serious)
Version:	mysql-5.1	OS:	Any
Assigned to:	Hartmut Holzgraefe	CPU Architecture:	Any
Tags:	5.1.22

Description:
Setup: master and slave cluster running 2 nodes, 2 replicas, default settings.

On the master a minimal auto_increment table is created and populated with test data, doubling the number of rows in that table with every transaction.

So both the master and slave table should always have a row count that is a power of two.

When going from 32768 to 65536 rows on the master the slave fails to catch up, it doesn't report any error but the number of rows on the slave tables usually ends up in the 36-37k range with the largest auto_increment id in that table at around 44k. Both should be either 32768 or 65536, depending on whether the last replicated transaction was committed successfully on the slave or rolled back

Another oddity is that the master usually (but not always) gets an
  
  ERROR 1297 (HY000): Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER

on the last transaction, the number of rows doubled by the last transaction though nonetheless so the transaction was actually commited and *not* aborted as claimed ...

I did about 4-5 test runs, only on the first one i didn't get the node failure message (caused by SendBuffer overflow). *Maybe* it was just not reported by the mysqld, although i'm pretty sure both nodes were still up and running at that point, i can't really verify this anymore though ...

How to repeat:
Create a simple master-slave cluster replication setup (2 nodes, 2 replicas, row based replication, all parameters using default settings)

On the master do:

  CREATE TABLE t1 (i INT PRIMARY KEY AUTO_INCREMENT) ENGINE=ndb;

  INSERT INTO t1 VALUES (0); -- first data row using auto_increment

then repeat 

  INSERT INTO t1 SELECT 0 FROM t1; -- double the row count using auto_inc

until the number of rows affected reaches 32768, so that a total of 65536 rows is in the test table.

A "SELECT COUNT(*) FROM t1;" on the master will always show 65536, even when the "node failure, transaction aborted" message appears.

The slave on the other had will only partially apply the last transaction and will so get seriously out of sync with the master. To make things worse there slave does not seem to report the problem in any way so it could go unnoticed for quite a while ...

This is not just a COUNT(*) glitch, a "SELECT * FROM t1 ORDER BY i;" shows the same number of rows and the row gaps in the 32769-65536 range are clearly visible.

Suggested fix:
This may be related to auto_increment, so i will try to create a non-autoincrement test case, too.

We recently saw similar issue. While doing a small insert (16 rows), our slave did not receive the updates, but replication continued normally (no error in the mysql client, and no errors in show slave status). The rows appear as expected on the master. They do not appear on slave. We are using auto increment primary keys.

Need to re-verify on latest.

Can't reproduce this one with current releases