Description:
Setup: master and slave cluster running 2 nodes, 2 replicas, default settings.
On the master a minimal auto_increment table is created and populated with test data, doubling the number of rows in that table with every transaction.
So both the master and slave table should always have a row count that is a power of two.
When going from 32768 to 65536 rows on the master the slave fails to catch up, it doesn't report any error but the number of rows on the slave tables usually ends up in the 36-37k range with the largest auto_increment id in that table at around 44k. Both should be either 32768 or 65536, depending on whether the last replicated transaction was committed successfully on the slave or rolled back
Another oddity is that the master usually (but not always) gets an
ERROR 1297 (HY000): Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER
on the last transaction, the number of rows doubled by the last transaction though nonetheless so the transaction was actually commited and *not* aborted as claimed ...
I did about 4-5 test runs, only on the first one i didn't get the node failure message (caused by SendBuffer overflow). *Maybe* it was just not reported by the mysqld, although i'm pretty sure both nodes were still up and running at that point, i can't really verify this anymore though ...
How to repeat:
Create a simple master-slave cluster replication setup (2 nodes, 2 replicas, row based replication, all parameters using default settings)
On the master do:
CREATE TABLE t1 (i INT PRIMARY KEY AUTO_INCREMENT) ENGINE=ndb;
INSERT INTO t1 VALUES (0); -- first data row using auto_increment
then repeat
INSERT INTO t1 SELECT 0 FROM t1; -- double the row count using auto_inc
until the number of rows affected reaches 32768, so that a total of 65536 rows is in the test table.
A "SELECT COUNT(*) FROM t1;" on the master will always show 65536, even when the "node failure, transaction aborted" message appears.
The slave on the other had will only partially apply the last transaction and will so get seriously out of sync with the master. To make things worse there slave does not seem to report the problem in any way so it could go unnoticed for quite a while ...
This is not just a COUNT(*) glitch, a "SELECT * FROM t1 ORDER BY i;" shows the same number of rows and the row gaps in the 32769-65536 range are clearly visible.
Suggested fix:
This may be related to auto_increment, so i will try to create a non-autoincrement test case, too.