MySQL Bugs: #21470: Cluster crash for no apparent reason

Bug #21470	Cluster crash for no apparent reason
Submitted:	7 Aug 2006 4:57	Modified:	26 Oct 2006 17:33
Reporter:	Jason Downing	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	5.0.22	OS:	Linux (Debian 2.4.26-386)
Assigned to:		CPU Architecture:	Any
Tags:	cluster crash

Description:
My 5.0.22 cluster crashed for no apparent reason. I checked the logs but could not find an explanation for it. The cluster was not under any load, except for a regular query we do every second to check our application. The query is a very simple select count(*) from a table.

I have attached the tracelogs and all other relevant files. If there is anything missing let me know and I'll get it for you.

How to repeat:
Not really sure, just run the system I guess.

Data directory minus fs from data node 2

Attachment: data2.tar.gz (application/x-gzip, text), 157.63 KiB.

Data directory minus fs from data node 3

Attachment: data3.tar.gz (application/x-gzip, text), 161.55 KiB.

mysql-cluster directory including config.ini

Attachment: mysql-cluster.tar.gz (application/x-gzip, text), 10.27 KiB.

Changing to Cluster's Category.

Hi

The nodes dies, as they loose connection to each-other (and ndb_mgmd)

There are lots of missed heartbeats and some mgm printouts in logs.
This could indicate some network problem,
  or some temporary peak in load on computer 
  (e.g. caused by some external backup software)

Can this be the case ?

/Jonas

Thanks for the info. I will investigate the network first. The switch is not a high grade unit, so perhaps it is causing the problem. I will get a better one and try that. We do have some regular jobs as well so I will see if one was running when the crash happened.

Is there a way I can decipher the tracelogs myself to find these things out?

Hi

Also another thing that can cause spurious crashes is swapping.
If your ndbd does not fit in physical ram, a significant delay is introduced
  when being swapped in. This can cause other node(s) to think that a node has
  died.

Regarding decipher traclogs...thats a hard task...and we currently dont have 
  any tool to help...

/Jonas

Here is the output from free, approx the same on both data nodes:

             total       used       free     shared    buffers     cached
Mem:        451528     446976       4552          0      54188      99996
-/+ buffers/cache:     292792     158736
Swap:      1349420     961988     387432

The sticker on the ram says it is 512MB.

Here is my config:

[NDBD DEFAULT]
NoOfReplicas=2
DataMemory=250M
IndexMemory=50M
MaxNoOfAttributes=3000
MaxNoOfConcurrentOperations=1000000
StartFailureTimeout=1000000
StartPartialTimeout=200000
LogLevelStartup=15
LogLevelShutdown=15
LogLevelStatistic=15
LogLevelCheckpoint=15
LogLevelNodeRestart=15
LogLevelConnection=15
LogLevelError=15
LogLevelInfo=15
StopOnError=N

[NDB_MGMD]
hostname=192.168.0.17
datadir=/var/lib/mysql-cluster
Id=1

[NDBD]
hostname=192.168.0.10
datadir=/usr/local/mysql/data
Id=2

[NDBD]
hostname=192.168.0.11
datadir=/usr/local/mysql/data
Id=3

#[MYSQLD]
#hostname=192.168.0.13
#Id=4

[MYSQLD]
hostname=192.168.0.14
Id=4

[MYSQLD]
hostname=192.168.0.15
Id=5

Do you think it would be swapping? The data nodes are not running any other applications.

Since I have all of the logging turned up to maximum, could I make the comment that it would be very useful if the missed heartbeats were logged? If they were I may have figured this out for myself. Might I suggest it be included in the next release?

Thanks, Jason

I have had another crash. The cluster was running properly for 6 days, then a crash. I have attached the trace logs. I am using a new switch. Could you tell me if there are still missed heartbeats? Could you also comment on my previous post about the ram swapping? Thanks, Jason

Another crash

Attachment: 16-8 crash.zip (application/zip, text), 116.43 KiB.

Hi

Sorry for late reply.
1) Yes it can be swapping, have 1M concurrent operations consumes some memory.
   You can try using "LockPagesInMainMemory: Yes", to avoid swapping.
   If ndbd then fails to lock pages, it will output a warning in ndb_X_out.log

2) I did not find real cause of this failure...

3) Missed hearbeat can be detected by the following (from your cluster log)
2006-07-31 13:44:02 [MgmSrvr] WARNING  -- Node 3: Node 5 missed heartbeat 2
2006-07-31 13:44:04 [MgmSrvr] WARNING  -- Node 2: Node 5 missed heartbeat 2
2006-07-31 13:44:04 [MgmSrvr] WARNING  -- Node 3: Node 5 missed heartbeat 3
2006-07-31 13:44:05 [MgmSrvr] WARNING  -- Node 2: Node 5 missed heartbeat 3
2006-07-31 13:44:06 [MgmSrvr] WARNING  -- Node 3: Node 5 missed heartbeat 4
2006-07-31 13:44:06 [MgmSrvr] ALERT    -- Node 3: Node 5 declared dead due to missed heartbeat

Please try with the LockPagesInMemory (which should be in your config.ini)

/Jonas

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".