Bug #22099 Cluster crash
Submitted: 7 Sep 2006 23:03 Modified: 12 Sep 2006 2:07
Reporter: Jason Downing Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.1.11 OS:Linux (debian 2.6.17)
Assigned to: CPU Architecture:Any
Tags: cluster crash, error 2305, forced node shutdown

[7 Sep 2006 23:03] Jason Downing
Description:
Cluster crashed for no apparent reason. Trace logs and config attached.

How to repeat:
Unknown
[7 Sep 2006 23:05] Jason Downing
All tracelogs, errorlogs, clusterlog and config

Attachment: cluster crash.zip (application/zip, text), 107.99 KiB.

[8 Sep 2006 2:27] MySQL Verification Team
Changing for related category: Clusetr.
[8 Sep 2006 8:35] Hartmut Holzgraefe
Not enough information was provided for us to be able to handle this bug. Please re-read the instructions at http://bugs.mysql.com/how-to-report.php

I can see in the logs that the arbitrator decided to shut down both nodes at the same time, but there is nothing in the logs indicating any reason for this.

If you can provide more information, feel free to add it to this bug and change the status back to 'Open'.

Thank you for your interest in MySQL.
[8 Sep 2006 9:04] Jonas Oreland
Hi,

One additional comment.
The node dies due to hearbeat failure (both of them)
This might mean that you run some big cron job in the middle of the
  night that e.g make diskbackup or similar...
  This can cause db-nodes to get swapped out, and cause hearbeat failures..

/Jonas
[12 Sep 2006 2:07] Jason Downing
Hi Jonas,

Thanks for the info about the missed heartbeats. I've considered this carefully for a few days, and this is my conclusion: 

First, there ia no cron job running on either data node, or the management node. There may be on the sql nodes, I'm not entirely sure. I can investigate if you like. Both data nodes are dedicated machines running 2.6.17 debian, and the only packages are ntpd, ntp-simple, vsftpd, and whatever debian comes with standard.

The database was under heavy load.

It seems to me that if the data node was only performing mysql functions, then it must be these functions that are causing the machine to miss heartbeats.

The data nodes have 512 MB of ram each, and data/index are set to 250/50. This should leave 200 MB free on the machine, to handle only minor tasks.

My conclusion is that the problem is in mysql, and that unless it is fixed, mysql will never perform properly under heavy load. I suggest that mysql is too busy with database tasks to respond to the heartbeats, and that the best fix is to have mysql assign a higher priority to heartbeat responses.

I would also like to reiterate my previous suggestion that missed heart beats should be logged in the data nodes in the ndb_out.log files.

I am happy to provide whatever I can to resolve this problem. If you would like me to set up my test cluster, give you access and load it heavily so you can investigate, please give me the word and I will.

Thanks, Jason