Bug #21494 | Master Cluster MySQLD is point of failure that can lead to mismatched slave data | ||
---|---|---|---|
Submitted: | 7 Aug 2006 22:16 | Modified: | 9 Oct 2007 15:43 |
Reporter: | Jonathan Miller | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Replication | Severity: | S2 (Serious) |
Version: | 5.1.12 | OS: | Linux (Linux) |
Assigned to: | Tomas Ulin | CPU Architecture: | Any |
[7 Aug 2006 22:16]
Jonathan Miller
[9 Aug 2006 12:10]
Jonathan Miller
Per Lars request, the one that has been tested is as follows. 3 Host Master Cluster Host #1 NDBD, NDB_MGMD, MySQLD Host #2 NDBD Host #3 MySQLD *** Master for Replication **** 3 Host Slave Cluster Host #4 NDBD, NDB_MGMD, Host #5 NDBD Host #6 MySQLD *** Slave for Replication **** Start TPC-B loading against host #1, shutdown network card on host #3. You will see that it is no longer part of cluster by logging into ndb_mgm. Looking at the slave through host #6 all looks normal. After about 2 or 3 minutes enable the card on host #3 once the load complete count the records of each table. Here is what I got last time I did it. Master Slave Account 100,000 37,890 Branch 10,000 0 Teller 20,000 0
[25 Aug 2006 20:58]
Lars Thalmann
As I see it, the problem is that there is currently no good way to make the slave notice that the log contains a gap due to the fact that a mysqld has been down for a while. Normally the master mysqld should be monitored, so that cluster replication can fail-over to another replication channel, but if this is not done, then the binlog might contain a gap. There are some possible solutions for this bug (the "SUMA subscription" is what mysqld uses to get the internal cluster change log which it injects into its binlog): 1) Stable SUMA subscription. Make the mysqld SUMA subscription withstand restart of mysqld. The restarted mysqld needs to "remember" the last event binlogged, so that it can resume SUMA subscription on the correct epoch. The negative with this solution is that it might take too long to implement. Also it is a bit unclear how mysqld would store epoch information. 2) Cluster awareness of mysqld failure. If the mysqld server is restarted, then the SUMA subscription needs to be started from scratch and the DBA gets informed about the failure, so that he can (manually or automatically) switch replication to a different replication channel. The negative with this solution is that the DBA might still just let the slave continue to replicate ignoring the failed mysqld. Then the log will contain gaps and the slave will have too few updates. 3) Slave gap awareness. Make it possible for the slave to notice that there is a gap in the binary log (due to the fact that the SUMA subscription was lost for a while). If the gap is received the slave stops with an error message. It is then up to the DBA to (manually or automatically) fail-over to a different replication channel. It seems that if 1 is not feasible, then 3 is the solution to go for. To make a replication framework where two replication channels can have gaps and the slave cluster being able to switch between these replication channels to get "a full log", I think we need the gap event anyway, so this seems like the solution to aim for in the long run.
[3 Apr 2007 12:17]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/23665 ChangeSet@1.2543, 2007-04-03 14:31:46+02:00, tomas@whalegate.ndb.mysql.com +3 -0 Bug #21494 Master Cluster MySQLD is point of failure that can lead to mismatch slave data - insert gap event on cluster connect
[3 Apr 2007 12:35]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/23667 ChangeSet@1.2544, 2007-04-03 14:49:57+02:00, tomas@whalegate.ndb.mysql.com +2 -0 Bug #21494 Master Cluster MySQLD is point of failure that can lead to mismatch slave data - insert gap event on cluster connect
[7 Apr 2007 7:01]
Bugs System
Pushed into 5.1.18-beta
[10 Apr 2007 12:20]
Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release. If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at http://dev.mysql.com/doc/en/installing-source.html Documented fix in 5.1.18 and telco-6.2.1 changelogs; documented applicable info from WL#3464 in Cluster Replication section of 5.1 Manual.
[15 Sep 2007 17:06]
MySQL Verification Team
If I cause the running master mysqld node to disconnect and reconnect to the cluster, by severing the network link, it will add a LOST_EVENTS entry to the binlog as expected. However, when the master mysqld node crashes or has a normal restart it will not create the LOST_EVENTS entry in the binlog. This entry should be added to the binlog at each startup. Without it the slave will not know that the master may have missed entries while offline. The slave will then reconnect to the master and resume replication while missing log entries.
[9 Oct 2007 15:43]
MySQL Verification Team
Recent report being handled in Bug #31484