MySQL Bugs: #63300: MySQL Cluster keeps crashing

Bug #63300	MySQL Cluster keeps crashing
Submitted:	17 Nov 2011 5:53	Modified:	20 Dec 2011 18:38
Reporter:	Srikrishnan Chitoor	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	mysql-5.1.56 ndb-7.1.15	OS:	Linux (Cent OS 5.6 - 32 Bit)
Assigned to:		CPU Architecture:	Any
Tags:	2305, crash, error

Description:
MySQL cluster keeps crashing randomly mostly with "Error: 2305". 

How to repeat:
Cannot predictably repeat. There appears to be no pattern. There will be no crash for days together. Suddenly there might be 2 crashes in a day, etc.

It also does not seem to be load related.

Suggested fix:
Since I have put StopOnError as "false", it keeps coming up after the crash, but there is always a brief outage when this happens.

question 1: is this 7.1.15 or 7.1.15a
  7.1.15 contained a very serious bug....

  if you're using 7.1.15 (wo/ a) please retry with 7.1.15a

else

question 2: the error report contains nothing but your config.ini
(maybe it failed to retreive other files)

we're also interested in
ndb_*_cluster.log*
ndb_*_error.log
ndb_*_trace.log.*

/Jonas

Thanks for the prompt reply.

I installed 7.1.15a. Pls. see the output of "rpm -qa|grep -i mysql" command below:

** START
MySQL-Cluster-gpl-storage-7.1.15a-1.rhel5
MySQL-Cluster-gpl-server-7.1.15a-1.rhel5
MySQL-Cluster-gpl-client-7.1.15a-1.rhel5
MySQL-Cluster-gpl-shared-7.1.15a-1.rhel5
MySQL-Cluster-gpl-devel-7.1.15a-1.rhel5
** END

However, when I do ndb_mgm from Management node and do a "show", it shows 

mysql-5.1.56 ndb-7.1.15, Nodegroup: 0, Master

I have also attached the full trace and cluster files in here.

ndb_*_cluster.log* is still missing...

Added Cluster log from NDB Management server. The Data/MySQL nodes do not have any logs like *cluster*.log

Looking at cluster log...you can see sporadic missed heartbeats
that sometimes leads to nodes being voted out of cluster,
sometimes the arbitrator is voted out of cluster,
making node failures become cluster failures.

It seems that your platform is not real-time enough
or that you run other tasks on them, which sometimes
gives unpredictable response-times to data-nodes.

I suggest you try with
HeartbeatIntervalDbDb=5000
HeartbeatIntervalDbApi=5000

This means that failure detection will be somewhat slower
(if machine is rebooted, wo/ killing processes first..i.e hard reboot)
but that cluster should be much more resilient to temporary latency spikes

/Jonas

Setting status to: Waiting on feedback

Have changed the configuration and restarted services. So far (36 hours after change), there is no issue. Will observe for a week and give feedback.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".