MySQL Bugs: #22099: Cluster crash

Bug #22099	Cluster crash
Submitted:	7 Sep 2006 23:03	Modified:	12 Sep 2006 2:07
Reporter:	Jason Downing	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	5.1.11	OS:	Linux (debian 2.6.17)
Assigned to:		CPU Architecture:	Any
Tags:	cluster crash, error 2305, forced node shutdown

Description:
Cluster crashed for no apparent reason. Trace logs and config attached.

How to repeat:
Unknown

All tracelogs, errorlogs, clusterlog and config

Attachment: cluster crash.zip (application/zip, text), 107.99 KiB.

Changing for related category: Clusetr.

Not enough information was provided for us to be able to handle this bug. Please re-read the instructions at http://bugs.mysql.com/how-to-report.php

I can see in the logs that the arbitrator decided to shut down both nodes at the same time, but there is nothing in the logs indicating any reason for this.

If you can provide more information, feel free to add it to this bug and change the status back to 'Open'.

Thank you for your interest in MySQL.

Hi,

One additional comment.
The node dies due to hearbeat failure (both of them)
This might mean that you run some big cron job in the middle of the
  night that e.g make diskbackup or similar...
  This can cause db-nodes to get swapped out, and cause hearbeat failures..

/Jonas

Hi Jonas,

Thanks for the info about the missed heartbeats. I've considered this carefully for a few days, and this is my conclusion: 

First, there ia no cron job running on either data node, or the management node. There may be on the sql nodes, I'm not entirely sure. I can investigate if you like. Both data nodes are dedicated machines running 2.6.17 debian, and the only packages are ntpd, ntp-simple, vsftpd, and whatever debian comes with standard.

The database was under heavy load.

It seems to me that if the data node was only performing mysql functions, then it must be these functions that are causing the machine to miss heartbeats.

The data nodes have 512 MB of ram each, and data/index are set to 250/50. This should leave 200 MB free on the machine, to handle only minor tasks.

My conclusion is that the problem is in mysql, and that unless it is fixed, mysql will never perform properly under heavy load. I suggest that mysql is too busy with database tasks to respond to the heartbeats, and that the best fix is to have mysql assign a higher priority to heartbeat responses.

I would also like to reiterate my previous suggestion that missed heart beats should be logged in the data nodes in the ndb_out.log files.

I am happy to provide whatever I can to resolve this problem. If you would like me to set up my test cluster, give you access and load it heavily so you can investigate, please give me the word and I will.

Thanks, Jason