Description:
I have MySQL cluster 6.2.15 running production now.
Previously I had faced serious error about “Node 2 killed this node because GCP stop was detected” with Disk Data in use.
After follow the guide provided at http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndbd-definition.html#mysql-cluster-nd..., problem seems to be resolved by increase the DiskPageBufferMemory.
But found that error of GCP occurred again on 4 August 2009 - 06:46:43 since Frebruary.
After few months later, cluster received series of, Signal 11 received; Segmentation fault error, and leads to cluster down and require ndbd –initial to start it back.
Three weeks before the crashed, I had just done altering activities which include increase the varchar size and adding new tables. After the alteration, I even performed cluster restart. The flow of the incident is as below.
* Please take note that those log’s timestamps at management node got -19 minutes different from data nodes since data nodes and management node date are not aligned.
1. 2009-08-04 06:46:43: GCP stop detected on node 3
2. 2009-08-04 08:50:22: Node 3 manually start back.
3. 2009-11-12 00:xx:xx: Altered tables and added new tables.
4. 2009-11-12 00:20:47: Restart cluster
5. 2009-12-04 14:21:50: Node 2 crashed
ndb_2_out.log
2009-12-04 14:21:50: [ndbd] WARNING -- Ndb kernel is stuck in: Performing Send
ndb_1_cluster.log
2009-12-04 14:03:08 [MgmSrvr] WARNING -- Node 3: Node 2 missed heartbeat 2
2009-12-04 14:03:10 [MgmSrvr] WARNING -- Node 3: Node 2 missed heartbeat 3
2009-12-04 14:03:11 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2009-12-04 14:03:11 [MgmSrvr] WARNING -- Node 3: Node 2 missed heartbeat 4
2009-12-04 14:03:11 [MgmSrvr] ALERT -- Node 3: Node 2 declared dead due to missed heartbeat
6. 2009-12-04 15:23:09: Try to start node 2
7. 2009-12-04 15:24:28: Node 2 failed to start because of Signal 11 received; Segmentation fault
8. 2009-12-04 15:27:21: Try to start node 2 again for few times until 17:19:50 but still Signal 11 received; Segmentation fault
9. 2009-12-04 17:19:50: Node 2 successfully started this time
10. 2009-12-04 21:27:46: Node 3 crashed
ndb_3_out.log
2009-12-04 21:27:46 [ndbd] INFO -- Node 3 killed this node because GCP stop was detected
11. 2009-12-04 23:02:30: Try to start node 3 for few times until 00:26:33 but still Signal 11 received; Segmentation fault
12. 2009-12-05 00:26:33: Node 3 successfully started this time
13. 2009-12-06 12:00:04: Node 2 crashed
ndb_2_out.log
2009-12-06 12:00:04 [ndbd] WARNING -- Ndb kernel is stuck in: Performing Send
2009-12-06 12:00:04 [ndbd] INFO -- Watchdog: User time: 42250 System time: 21784
2009-12-06 12:35:53 [ndbd] ALERT -- Node 2: Forced node shutdown completed. Initiated by signal 9.
ndb_1_cluster.log
2009-12-06 12:17:02 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
14. 2009-12-06 23:00:xx: Try to start node 2 by issued the ndbd but this lead to node 2 server hang. (Server located at different place, require ppl to go the site and physically reset the machine)
15. 2009-12-06 23:xx:xx: Check at management node using ndb_mgm, shown that there is no node connected at all. (Node 3 also disconnected)
16. 2009-12-06 23:xx:xx: Service unable to access node 3 MySQL API even ndbd is running after checked by ps command.
17. 2009-12-06 23:16:xx: Kill the ndbd process at Node 3 and shutdown the mysql properly.
18. 2009-12-06 23:19:19: Start the node 3 ndbd but found Signal 11 received; Segmentation fault
19. 2009-12-06 23:56:40: node 2 server restarted physically
20. 2009-12-06 23:56:40: Try start both node 2 and 3 simultaneously for few times until 2009-12-07 02:27:57 but found Signal 11 received; Segmentation fault
21. After that we decided to start the cluster by issue –initial and restore from previous backup. Cluster work fine until now.
I found an odd situation that crashed node only manage to start back after 3 hours from the crash. Does the master node doing anything within this period causing another node received segmentation error?
2009-12-04 14:21:50 Node 2 crashed
2009-12-04 17:19:50: Node 2 successfully started after many tries
2009-12-04 21:27:46: Node 3 crashed
2009-12-05 00:26:33: Node 3 successfully started after many tries
1. Why the GCP error still occurred even after set the DiskPageBufferMemory
2. Why cluster return Segmentation fault while trying to start the cluster? (Those cases with success start up after the crash)
Please kindly suggest fix for the situation above. Thanks in million.
How to repeat:
Cannot repeat, as it happen randomly.
Suggested fix:
Before the last crash, just require start back the ndbd
On the last crash, require restore database.