Bug #63300 MySQL Cluster keeps crashing
Submitted: 17 Nov 2011 5:53 Modified: 20 Dec 2011 18:38
Reporter: Srikrishnan Chitoor Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:mysql-5.1.56 ndb-7.1.15 OS:Linux (Cent OS 5.6 - 32 Bit)
Assigned to: CPU Architecture:Any
Tags: 2305, crash, error

[17 Nov 2011 5:53] Srikrishnan Chitoor
Description:
MySQL cluster keeps crashing randomly mostly with "Error: 2305". 

How to repeat:
Cannot predictably repeat. There appears to be no pattern. There will be no crash for days together. Suddenly there might be 2 crashes in a day, etc.

It also does not seem to be load related.

Suggested fix:
Since I have put StopOnError as "false", it keeps coming up after the crash, but there is always a brief outage when this happens.
[17 Nov 2011 6:04] Jonas Oreland
question 1: is this 7.1.15 or 7.1.15a
  7.1.15 contained a very serious bug....

  if you're using 7.1.15 (wo/ a) please retry with 7.1.15a

else

question 2: the error report contains nothing but your config.ini
(maybe it failed to retreive other files)

we're also interested in
ndb_*_cluster.log*
ndb_*_error.log
ndb_*_trace.log.*

/Jonas
[17 Nov 2011 6:17] Srikrishnan Chitoor
Thanks for the prompt reply.

I installed 7.1.15a. Pls. see the output of "rpm -qa|grep -i mysql" command below:

** START
MySQL-Cluster-gpl-storage-7.1.15a-1.rhel5
MySQL-Cluster-gpl-server-7.1.15a-1.rhel5
MySQL-Cluster-gpl-client-7.1.15a-1.rhel5
MySQL-Cluster-gpl-shared-7.1.15a-1.rhel5
MySQL-Cluster-gpl-devel-7.1.15a-1.rhel5
** END

However, when I do ndb_mgm from Management node and do a "show", it shows 

mysql-5.1.56 ndb-7.1.15, Nodegroup: 0, Master

I have also attached the full trace and cluster files in here.
[17 Nov 2011 7:18] Jonas Oreland
ndb_*_cluster.log* is still missing...
[17 Nov 2011 7:54] Srikrishnan Chitoor
Added Cluster log from NDB Management server. The Data/MySQL nodes do not have any logs like *cluster*.log
[17 Nov 2011 8:31] Jonas Oreland
Looking at cluster log...you can see sporadic missed heartbeats
that sometimes leads to nodes being voted out of cluster,
sometimes the arbitrator is voted out of cluster,
making node failures become cluster failures.

It seems that your platform is not real-time enough
or that you run other tasks on them, which sometimes
gives unpredictable response-times to data-nodes.

I suggest you try with
HeartbeatIntervalDbDb=5000
HeartbeatIntervalDbApi=5000

This means that failure detection will be somewhat slower
(if machine is rebooted, wo/ killing processes first..i.e hard reboot)
but that cluster should be much more resilient to temporary latency spikes

/Jonas

Setting status to: Waiting on feedback
[19 Nov 2011 2:14] Srikrishnan Chitoor
Have changed the configuration and restarted services. So far (36 hours after change), there is no issue. Will observe for a week and give feedback.
[21 Dec 2011 7:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".