MySQL Bugs: #70963: NDB node failure with no network/system load

Bug #70963	NDB node failure with no network/system load
Submitted:	20 Nov 2013 15:35	Modified:	29 Mar 2016 12:21
Reporter:	Brian Hobson	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	7.2.14	OS:	Linux (RHEL 5.8 x86_64)
Assigned to:	MySQL Verification Team	CPU Architecture:	Any
Tags:	arbitration error, ndb, unpartitioned cluster

Description:
I performed a clean install of our ndb cluster and allowed it to run over night.  The following morning I came in and found that one or both data nodes had died or been forced down due to an error.  

- On the first data node, I see the following error:

System error, node killed during node restart by other node (Internal eror, programming error or missing error message)
Error 2303
Node 4 killed this node because GCP stop was detected

- On the second data node, I see the following:

Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s) (Arbitration error)
Error:2305
Arbitrator decided to shutdown this node

"Lost connection" sounds like network congestion/failure.  However, this is a lab setup with almost no load on the system or network devices.

Attached is the ndb_error package.

How to repeat:
Not sure, it is a fairly re-occurring issue however.

Thank you for the report.

But version 7.2.7 is very old and many bugs were fixed since. Please
upgrade to current version 7.2.14 and inform us if the issue still
exists.

Also, please check the below manual page for "Node <nodeid> killed this node because GCP stop was detected" 

http://dev.mysql.com/doc/refman/5.5/en/mysql-cluster-ndbd-definition.html

Thanks,
Umesh

Hi Umesh,

Thanks for your suggestions.  I have since upgraded to 7.2.14 and have run more tests.  I am still seeing an issue where a ndbd node goes down due to missed heartbeats.  I am also seeing various nodes report disconnect/connect messages in the management node log just prior to the ndbd node going down.  The specific log file in question is 'ndb_2_cluster.log.1' at around 04:46:48.  At this time, nodes begin to disconnect and reconnect and miss heartbeats.  It is odd since this cluster resides on an isolated network which should have no load at all (except for cron, etc).  

Do you have any idea what could be causing this type of behavior?  Is there any way for me to further troubleshoot this issue?  I will upload an updated error report.

Thanks,
Brian

updated version to 7.2.14 (updated ndb in the cluster)

Hi,

Thanks for your report but this is not a bug. It is just improperly sized cluster for your needs. As MCCGE is a real time database timings needs to be followed so we will rather die then not work as expected.

7.4 release is bit more forgiving then older ones with regards to this miss configuration so you might want to try that one but the best way to size your cluster is to contact MySQL Support and they will help you configure the cluster for exactly what you need.

kind regards
Bogdan Kecman