Bug #21470 Cluster crash for no apparent reason
Submitted: 7 Aug 2006 4:57 Modified: 26 Oct 2006 17:33
Reporter: Jason Downing Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:5.0.22 OS:Linux (Debian 2.4.26-386)
Assigned to: CPU Architecture:Any
Tags: cluster crash

[7 Aug 2006 4:57] Jason Downing
Description:
My 5.0.22 cluster crashed for no apparent reason. I checked the logs but could not find an explanation for it. The cluster was not under any load, except for a regular query we do every second to check our application. The query is a very simple select count(*) from a table.

I have attached the tracelogs and all other relevant files. If there is anything missing let me know and I'll get it for you.

How to repeat:
Not really sure, just run the system I guess.
[7 Aug 2006 4:59] Jason Downing
Data directory minus fs from data node 2

Attachment: data2.tar.gz (application/x-gzip, text), 157.63 KiB.

[7 Aug 2006 5:00] Jason Downing
Data directory minus fs from data node 3

Attachment: data3.tar.gz (application/x-gzip, text), 161.55 KiB.

[7 Aug 2006 5:00] Jason Downing
mysql-cluster directory including config.ini

Attachment: mysql-cluster.tar.gz (application/x-gzip, text), 10.27 KiB.

[7 Aug 2006 12:14] MySQL Verification Team
Changing to Cluster's Category.
[7 Aug 2006 12:24] Jonas Oreland
Hi

The nodes dies, as they loose connection to each-other (and ndb_mgmd)

There are lots of missed heartbeats and some mgm printouts in logs.
This could indicate some network problem,
  or some temporary peak in load on computer 
  (e.g. caused by some external backup software)

Can this be the case ?

/Jonas
[7 Aug 2006 22:32] Jason Downing
Thanks for the info. I will investigate the network first. The switch is not a high grade unit, so perhaps it is causing the problem. I will get a better one and try that. We do have some regular jobs as well so I will see if one was running when the crash happened.

Is there a way I can decipher the tracelogs myself to find these things out?
[8 Aug 2006 7:54] Jonas Oreland
Hi

Also another thing that can cause spurious crashes is swapping.
If your ndbd does not fit in physical ram, a significant delay is introduced
  when being swapped in. This can cause other node(s) to think that a node has
  died.

Regarding decipher traclogs...thats a hard task...and we currently dont have 
  any tool to help...

/Jonas
[8 Aug 2006 23:05] Jason Downing
Here is the output from free, approx the same on both data nodes:

             total       used       free     shared    buffers     cached
Mem:        451528     446976       4552          0      54188      99996
-/+ buffers/cache:     292792     158736
Swap:      1349420     961988     387432

The sticker on the ram says it is 512MB.

Here is my config:

[NDBD DEFAULT]
NoOfReplicas=2
DataMemory=250M
IndexMemory=50M
MaxNoOfAttributes=3000
MaxNoOfConcurrentOperations=1000000
StartFailureTimeout=1000000
StartPartialTimeout=200000
LogLevelStartup=15
LogLevelShutdown=15
LogLevelStatistic=15
LogLevelCheckpoint=15
LogLevelNodeRestart=15
LogLevelConnection=15
LogLevelError=15
LogLevelInfo=15
StopOnError=N

[NDB_MGMD]
hostname=192.168.0.17
datadir=/var/lib/mysql-cluster
Id=1

[NDBD]
hostname=192.168.0.10
datadir=/usr/local/mysql/data
Id=2

[NDBD]
hostname=192.168.0.11
datadir=/usr/local/mysql/data
Id=3

#[MYSQLD]
#hostname=192.168.0.13
#Id=4

[MYSQLD]
hostname=192.168.0.14
Id=4

[MYSQLD]
hostname=192.168.0.15
Id=5

Do you think it would be swapping? The data nodes are not running any other applications.

Since I have all of the logging turned up to maximum, could I make the comment that it would be very useful if the missed heartbeats were logged? If they were I may have figured this out for myself. Might I suggest it be included in the next release?

Thanks, Jason
[15 Aug 2006 22:38] Jason Downing
I have had another crash. The cluster was running properly for 6 days, then a crash. I have attached the trace logs. I am using a new switch. Could you tell me if there are still missed heartbeats? Could you also comment on my previous post about the ram swapping? Thanks, Jason
[15 Aug 2006 22:40] Jason Downing
Another crash

Attachment: 16-8 crash.zip (application/zip, text), 116.43 KiB.

[26 Sep 2006 17:33] Jonas Oreland
Hi

Sorry for late reply.
1) Yes it can be swapping, have 1M concurrent operations consumes some memory.
   You can try using "LockPagesInMainMemory: Yes", to avoid swapping.
   If ndbd then fails to lock pages, it will output a warning in ndb_X_out.log

2) I did not find real cause of this failure...

3) Missed hearbeat can be detected by the following (from your cluster log)
2006-07-31 13:44:02 [MgmSrvr] WARNING  -- Node 3: Node 5 missed heartbeat 2
2006-07-31 13:44:04 [MgmSrvr] WARNING  -- Node 2: Node 5 missed heartbeat 2
2006-07-31 13:44:04 [MgmSrvr] WARNING  -- Node 3: Node 5 missed heartbeat 3
2006-07-31 13:44:05 [MgmSrvr] WARNING  -- Node 2: Node 5 missed heartbeat 3
2006-07-31 13:44:06 [MgmSrvr] WARNING  -- Node 3: Node 5 missed heartbeat 4
2006-07-31 13:44:06 [MgmSrvr] ALERT    -- Node 3: Node 5 declared dead due to missed heartbeat

Please try with the LockPagesInMemory (which should be in your config.ini)

/Jonas
[26 Oct 2006 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".