Bug #46561 stopping half of the cluster kills the whole cluster
Submitted: 5 Aug 2009 8:51 Modified: 20 Sep 2009 8:27
Reporter: Bogdan Kecman Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:mysql-5.1-telco-6.2 OS:Any
Assigned to: Assigned Account CPU Architecture:Any
Tags: 6.2, 6.3, 7.0

[5 Aug 2009 8:51] Bogdan Kecman
Description:
Shutting down half of the data nodes often cause whole cluster to crash (with missed heartbeat error).

How to repeat:
1. setup cluster with as many nodes as you can (more nodes give greater chance of 
crash)
2. stop half the cluster in same time

the rest of the cluster will crash with missed heartbeat error

Suggested fix:
Seems to me that chance for the cluster to crash in this scenario is directly proportional to traffic coming from api nodes and reverse proportional to CPU speed of the data nodes. 

Without any traffic and with "real" computers I was unable to duplicate this problem.

With traffic and with "real" computers I was unable to duplicate problem ~10% of the time.

With traffic and data nodes on virtual machines on single server I managed to reproduce the problem 70% of the time.

This makes me believe that when half of the cluster go down, the other half struggles with the load (initially the nodes need to take on to be "master" for all the table segments as the pair node is offline, then they need to take twice as much traffic as usual - this initially, the moment when half of the cluster go down, probably, takes too much cpu time so nodes miss HB, and as half of the cluster is down, the first one that miss it crashes the whole cluster) and misses HB. 

Maybe recognising such scenario and increasing the HB timeout when it occur would be the possible solution.
[7 Aug 2009 13:03] Jonas Oreland
Patch for 6.3.24

Attachment: bug46561-6.3.24.patch (text/x-patch), 9.93 KiB.

[20 Aug 2009 8:27] Jonas Oreland
I can't find your logs?
[20 Sep 2009 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".