Bug #21470 | Cluster crash for no apparent reason | ||
---|---|---|---|
Submitted: | 7 Aug 2006 4:57 | Modified: | 26 Oct 2006 17:33 |
Reporter: | Jason Downing | Email Updates: | |
Status: | No Feedback | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S3 (Non-critical) |
Version: | 5.0.22 | OS: | Linux (Debian 2.4.26-386) |
Assigned to: | CPU Architecture: | Any | |
Tags: | cluster crash |
[7 Aug 2006 4:57]
Jason Downing
[7 Aug 2006 4:59]
Jason Downing
Data directory minus fs from data node 2
Attachment: data2.tar.gz (application/x-gzip, text), 157.63 KiB.
[7 Aug 2006 5:00]
Jason Downing
Data directory minus fs from data node 3
Attachment: data3.tar.gz (application/x-gzip, text), 161.55 KiB.
[7 Aug 2006 5:00]
Jason Downing
mysql-cluster directory including config.ini
Attachment: mysql-cluster.tar.gz (application/x-gzip, text), 10.27 KiB.
[7 Aug 2006 12:14]
MySQL Verification Team
Changing to Cluster's Category.
[7 Aug 2006 12:24]
Jonas Oreland
Hi The nodes dies, as they loose connection to each-other (and ndb_mgmd) There are lots of missed heartbeats and some mgm printouts in logs. This could indicate some network problem, or some temporary peak in load on computer (e.g. caused by some external backup software) Can this be the case ? /Jonas
[7 Aug 2006 22:32]
Jason Downing
Thanks for the info. I will investigate the network first. The switch is not a high grade unit, so perhaps it is causing the problem. I will get a better one and try that. We do have some regular jobs as well so I will see if one was running when the crash happened. Is there a way I can decipher the tracelogs myself to find these things out?
[8 Aug 2006 7:54]
Jonas Oreland
Hi Also another thing that can cause spurious crashes is swapping. If your ndbd does not fit in physical ram, a significant delay is introduced when being swapped in. This can cause other node(s) to think that a node has died. Regarding decipher traclogs...thats a hard task...and we currently dont have any tool to help... /Jonas
[8 Aug 2006 23:05]
Jason Downing
Here is the output from free, approx the same on both data nodes: total used free shared buffers cached Mem: 451528 446976 4552 0 54188 99996 -/+ buffers/cache: 292792 158736 Swap: 1349420 961988 387432 The sticker on the ram says it is 512MB. Here is my config: [NDBD DEFAULT] NoOfReplicas=2 DataMemory=250M IndexMemory=50M MaxNoOfAttributes=3000 MaxNoOfConcurrentOperations=1000000 StartFailureTimeout=1000000 StartPartialTimeout=200000 LogLevelStartup=15 LogLevelShutdown=15 LogLevelStatistic=15 LogLevelCheckpoint=15 LogLevelNodeRestart=15 LogLevelConnection=15 LogLevelError=15 LogLevelInfo=15 StopOnError=N [NDB_MGMD] hostname=192.168.0.17 datadir=/var/lib/mysql-cluster Id=1 [NDBD] hostname=192.168.0.10 datadir=/usr/local/mysql/data Id=2 [NDBD] hostname=192.168.0.11 datadir=/usr/local/mysql/data Id=3 #[MYSQLD] #hostname=192.168.0.13 #Id=4 [MYSQLD] hostname=192.168.0.14 Id=4 [MYSQLD] hostname=192.168.0.15 Id=5 Do you think it would be swapping? The data nodes are not running any other applications. Since I have all of the logging turned up to maximum, could I make the comment that it would be very useful if the missed heartbeats were logged? If they were I may have figured this out for myself. Might I suggest it be included in the next release? Thanks, Jason
[15 Aug 2006 22:38]
Jason Downing
I have had another crash. The cluster was running properly for 6 days, then a crash. I have attached the trace logs. I am using a new switch. Could you tell me if there are still missed heartbeats? Could you also comment on my previous post about the ram swapping? Thanks, Jason
[15 Aug 2006 22:40]
Jason Downing
Another crash
Attachment: 16-8 crash.zip (application/zip, text), 116.43 KiB.
[26 Sep 2006 17:33]
Jonas Oreland
Hi Sorry for late reply. 1) Yes it can be swapping, have 1M concurrent operations consumes some memory. You can try using "LockPagesInMainMemory: Yes", to avoid swapping. If ndbd then fails to lock pages, it will output a warning in ndb_X_out.log 2) I did not find real cause of this failure... 3) Missed hearbeat can be detected by the following (from your cluster log) 2006-07-31 13:44:02 [MgmSrvr] WARNING -- Node 3: Node 5 missed heartbeat 2 2006-07-31 13:44:04 [MgmSrvr] WARNING -- Node 2: Node 5 missed heartbeat 2 2006-07-31 13:44:04 [MgmSrvr] WARNING -- Node 3: Node 5 missed heartbeat 3 2006-07-31 13:44:05 [MgmSrvr] WARNING -- Node 2: Node 5 missed heartbeat 3 2006-07-31 13:44:06 [MgmSrvr] WARNING -- Node 3: Node 5 missed heartbeat 4 2006-07-31 13:44:06 [MgmSrvr] ALERT -- Node 3: Node 5 declared dead due to missed heartbeat Please try with the LockPagesInMemory (which should be in your config.ini) /Jonas
[26 Oct 2006 23:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".