MySQL Bugs: #46762: Missing gap-event causes the slave to not stop resulting in data inconsistency

Bug #46762	Missing gap-event causes the slave to not stop resulting in data inconsistency
Submitted:	17 Aug 2009 17:32	Modified:	23 Sep 2009 10:39
Reporter:	Premraj Nallasivampillai	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Documentation	Severity:	S2 (Serious)
Version:	MySQL Cluster 7.0.6	OS:	Linux (RedHat EL 4 update 6 - 32bit)
Assigned to:	Jon Stephens	CPU Architecture:	Any
Tags:	cluster, geo-replication

Description:
During Cluster-to-Cluster async (geo) replication the master SQL node was disconnected from the network while some records were continuing to be inserted into ndb tables via the SQL node. This did not cause the replication slave to stop due the absence of gap-event in replication. The Master SQL node was then reconnected to the net. The slave IO and SQL were threads were running during the disconnect...reconnect.

The end result was that the slave ndb database ended up missing some inserted records and therefore in an inconsistent state (from the master).

OS: Redhat EL 4 update 6 - 32bit
Platform: Intel dual-core
MySQL: Cluster 7.0.6
Used ndbd instead of ndbmtd

How to repeat:
Here is the simple scenario, which shows the data loss/inconsistency between master and slave :

=> Three clusters configured for circular replication : cluster1 ->
cluster2 -> cluster3 -> cluster1
=> each cluster has two machines and each machine has all (mgm_node,
data_node & sql_node).
=> Sql_nodes S1, S3 & S5 of all three clusters respectively are
configured for circular replication.
=> Now, circular replication is running fine.
=> Now, my client application(single thread) started a 1000 transactions
(each transaction of 1 record) on S1 directly. And in the middle of
these transactions, I unplugged the network cable from the machine
(where S1 is running). Now client application stopped and I put the
network cable back.
=> I see, all records which were inserted into cluster1 are not getting
replicated to other clusters. (Say, cluster1 has 213 records and
cluster2 & cluster3 have 205 records and sometimes difference is more).

binlogs, relalogs, my.cnf, etc

Attachment: mysqld_logs.zip (application/x-zip-compressed, text), 338.84 KiB.

Do you have this replication problem also by using MySQL server and InnoDB/MyISAM or is it cluster related?

This cannot happen in a non-cluster setup since the
master is the node/process that stores the rows and will
still binlog them even if it is not connected to the slave.

In a cluster, if the master is  disconnected the rows can
still be inserted into the cluster, but these are not binlogged.
When the master SQL-node reconnects to the cluster
a gap event is to be inserted to inform the slave
to stop since it is probably out-of-sync. In a HA setup
one usually have two master SQL nodes and monitor them using
some external clusterware. If one master SQL node fails this
will be detected externally and one fails over to the other
master SQL node (using a different binlog/position).

I did not find any attached (ndb) logs from the cluster?
We need to analyze the cluster logs to see if the
SQL node was actually detached from the cluster.
This should be seen in the cluster log if there was
a TCP disconnect, also one can see possible heartbeat
failures, when SQL nodes are not responding.
I will keep analyzing by testing a setup myself, but
we need to check your logs as well.

Please notice these warnings found in the mysqld logs from
all the clusters:

090812 14:10:38 [Warning] NDB: server id set to zero will cause any other mysqld with bin log to log with wrong server id

090812 14:15:42 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=db7-relay-bin' to avoid this problem.

Replication does not seem to be properly setup.

cluster logs

Attachment: cluster-1.zip (application/x-zip-compressed, text), 363.79 KiB.

The configuration shortcomings are not relevant to this issue.

Please add all logs from the same test run, the oldest cluster logs
seem to be from a run 2009-08-13, but the mysqld log says 090812.
We need to correlate when cluster detected that the mysqld was
disconnected (or missed heartbeats) with what the mysqld did at
the same precise moment.
It is important to add all logs always to save time in analysis!

This did not appear to be a bug, but documentation seems
a bit unclear when the GAP event is received and the appropriate
action to take. Changing category and assign to docs.

Does not seem any feedback is requested anymore.
Changing to 'verified' state.

Thank you for your bug report. This issue has been addressed in the documentation. The updated documentation will appear on our website shortly, and will be included in the next release of the relevant products.