Bug #31340 Cluster replication failure on simple table with auto_increment
Submitted: 2 Oct 2007 12:22 Modified: 27 Oct 2010 20:06
Reporter: Hartmut Holzgraefe Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Replication Severity:S2 (Serious)
Version:mysql-5.1 OS:Any
Assigned to: Hartmut Holzgraefe CPU Architecture:Any
Tags: 5.1.22

[2 Oct 2007 12:22] Hartmut Holzgraefe
Description:
Setup: master and slave cluster running 2 nodes, 2 replicas, default settings.

On the master a minimal auto_increment table is created and populated with test data, doubling the number of rows in that table with every transaction.

So both the master and slave table should always have a row count that is a power of two.

When going from 32768 to 65536 rows on the master the slave fails to catch up, it doesn't report any error but the number of rows on the slave tables usually ends up in the 36-37k range with the largest auto_increment id in that table at around 44k. Both should be either 32768 or 65536, depending on whether the last replicated transaction was committed successfully on the slave or rolled back

Another oddity is that the master usually (but not always) gets an
  
  ERROR 1297 (HY000): Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER

on the last transaction, the number of rows doubled by the last transaction though nonetheless so the transaction was actually commited and *not* aborted as claimed ...

I did about 4-5 test runs, only on the first one i didn't get the node failure message (caused by SendBuffer overflow). *Maybe* it was just not reported by the mysqld, although i'm pretty sure both nodes were still up and running at that point, i can't really verify this anymore though ...

How to repeat:
Create a simple master-slave cluster replication setup (2 nodes, 2 replicas, row based replication, all parameters using default settings)

On the master do:

  CREATE TABLE t1 (i INT PRIMARY KEY AUTO_INCREMENT) ENGINE=ndb;

  INSERT INTO t1 VALUES (0); -- first data row using auto_increment

then repeat 

  INSERT INTO t1 SELECT 0 FROM t1; -- double the row count using auto_inc

until the number of rows affected reaches 32768, so that a total of 65536 rows is in the test table.

A "SELECT COUNT(*) FROM t1;" on the master will always show 65536, even when the "node failure, transaction aborted" message appears.

The slave on the other had will only partially apply the last transaction and will so get seriously out of sync with the master. To make things worse there slave does not seem to report the problem in any way so it could go unnoticed for quite a while ...

This is not just a COUNT(*) glitch, a "SELECT * FROM t1 ORDER BY i;" shows the same number of rows and the row gaps in the 32769-65536 range are clearly visible.

Suggested fix:
This may be related to auto_increment, so i will try to create a non-autoincrement test case, too.
[9 Apr 2008 22:40] Andrew Fritz
We recently saw similar issue. While doing a small insert (16 rows), our slave did not receive the updates, but replication continued normally (no error in the mysql client, and no errors in show slave status). The rows appear as expected on the master. They do not appear on slave. We are using auto increment primary keys.
[19 May 2009 12:39] Jonathan Miller
Need to re-verify on latest.
[27 Oct 2010 20:06] Hartmut Holzgraefe
Can't reproduce this one with current releases