MySQL Bugs: #21494: Master Cluster MySQLD is point of failure that can lead to mismatched slave data

Bug #21494	Master Cluster MySQLD is point of failure that can lead to mismatched slave data
Submitted:	7 Aug 2006 22:16	Modified:	9 Oct 2007 15:43
Reporter:	Jonathan Miller	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Replication	Severity:	S2 (Serious)
Version:	5.1.12	OS:	Linux (Linux)
Assigned to:	Tomas Ulin	CPU Architecture:	Any

Description:
Hi, Here is the setup.

I have a cluster of 4 systems, on two of them I run my data nodes, on the third I run my ndb_mgmd and a mysqld for some transactions. I also allow others to connect using ndb api for transactions. On my fourth computer I run mysqld and its only purpose is to replicate to the slave.

One day, "Billy Bob" walks behind the 4th sever and accidentally trips over the network cable on the fourth computer unplugging it. A minute or two passes before he sees that it is unplugged and immediately plugs it back in. Not thinking anything about it he does not tell anyone.

The slave does not check for "slave-net-timeout" for 3600 seconds, so the slave shows no errors and replication continues. Problem is we are now missing 2 minutes of transaction that happened on the other mysqld and through the ndb api's.

Therefore, the slave is not an exact replica of the master and no one knows this.

NOTE: This could happen just by restarting the mysqld on the 4th host while the cluster is taking transaction on the other mysqld and through the ndb api's

How to repeat:
Use setup from above and shutdown network card to 2 minutes, or just restart the mysqld on the 4th system

Per Lars request, the one that has been tested is as follows.

3 Host Master Cluster
Host #1 NDBD, NDB_MGMD, MySQLD
Host #2 NDBD
Host #3 MySQLD *** Master for Replication ****

3 Host Slave Cluster
Host #4 NDBD, NDB_MGMD, 
Host #5 NDBD
Host #6 MySQLD *** Slave for Replication ****

Start TPC-B loading against host #1, shutdown network card on host #3. You will see that it is no longer part of cluster by logging into ndb_mgm. Looking at the slave through host #6 all looks normal. After about 2 or 3 minutes enable the card on host #3 once the load complete count the records of each table. Here is what I got last time I did it.

             Master     Slave 
Account      100,000    37,890
Branch        10,000         0
Teller        20,000         0

As I see it, the problem is that there is currently no good way to
make the slave notice that the log contains a gap due to the fact that
a mysqld has been down for a while.

Normally the master mysqld should be monitored, so that cluster
replication can fail-over to another replication channel, but if this
is not done, then the binlog might contain a gap.

There are some possible solutions for this bug (the "SUMA
subscription" is what mysqld uses to get the internal cluster change
log which it injects into its binlog):

1) Stable SUMA subscription.  Make the mysqld SUMA subscription
   withstand restart of mysqld.  The restarted mysqld needs to "remember"
   the last event binlogged, so that it can resume SUMA subscription on
   the correct epoch.

   The negative with this solution is that it might take too long to
   implement.  Also it is a bit unclear how mysqld would store epoch
   information.

2) Cluster awareness of mysqld failure.  If the mysqld server is
   restarted, then the SUMA subscription needs to be started from scratch
   and the DBA gets informed about the failure, so that he can (manually
   or automatically) switch replication to a different replication
   channel.

   The negative with this solution is that the DBA might still just
   let the slave continue to replicate ignoring the failed mysqld.  Then
   the log will contain gaps and the slave will have too few updates.

3) Slave gap awareness.  Make it possible for the slave to notice that
   there is a gap in the binary log (due to the fact that the SUMA
   subscription was lost for a while).  If the gap is received the slave
   stops with an error message.  It is then up to the DBA to (manually
   or automatically) fail-over to a different replication channel.

It seems that if 1 is not feasible, then 3 is the solution to
go for.

To make a replication framework where two replication channels can
have gaps and the slave cluster being able to switch between these
replication channels to get "a full log", I think we need the gap
event anyway, so this seems like the solution to aim for in the long
run.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/23665

ChangeSet@1.2543, 2007-04-03 14:31:46+02:00, tomas@whalegate.ndb.mysql.com +3 -0
  Bug #21494 Master Cluster MySQLD is point of failure that can lead to mismatch slave data
  - insert gap event on cluster connect

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/23667

ChangeSet@1.2544, 2007-04-03 14:49:57+02:00, tomas@whalegate.ndb.mysql.com +2 -0
    Bug #21494 Master Cluster MySQLD is point of failure that can lead to mismatch slave data
    - insert gap event on cluster connect

Pushed into 5.1.18-beta

Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented fix in 5.1.18 and telco-6.2.1 changelogs; documented applicable info from WL#3464 in Cluster Replication section of 5.1 Manual.

If I cause the running master mysqld node to disconnect and reconnect to the cluster, by severing the network link, it will add a LOST_EVENTS entry to the binlog as expected.

However, when the master mysqld node crashes or has a normal restart it will not create the LOST_EVENTS entry in the binlog.  This entry should be added to the binlog at each startup.  Without it the slave will not know that the master may have missed entries while offline.  The slave will then reconnect to the master and resume replication while missing log entries.

Recent report being handled in Bug #31484