Bug #49585 Signal 11 received; Segmentation fault error
Submitted: 10 Dec 2009 12:58 Modified: 9 Mar 2016 16:45
Reporter: WAI MUN ALEX LOO Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:mysql-5.1-telco-6.2 OS:Linux (Red Hat 5.1)
Assigned to: CPU Architecture:Any
Tags: 6.2.15

[10 Dec 2009 12:58] WAI MUN ALEX LOO
Description:
I have MySQL cluster 6.2.15 running production now.

Previously I had faced serious error about “Node 2 killed this node because GCP stop was detected” with Disk Data in use.
After follow the guide provided at http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndbd-definition.html#mysql-cluster-nd..., problem seems to be resolved by increase the DiskPageBufferMemory.
But found that error of GCP occurred again on 4 August 2009 - 06:46:43 since Frebruary.

After few months later, cluster received series of, Signal 11 received; Segmentation fault error, and leads to cluster down and require ndbd –initial to start it back.
Three weeks before the crashed, I had just done altering activities which include increase the varchar size and adding new tables. After the alteration, I even performed cluster restart. The flow of the incident is as below.

* Please take note that those log’s timestamps at management node got -19 minutes different from data nodes since data nodes and management node date are not aligned.

1. 2009-08-04 06:46:43: GCP stop detected on node 3
2. 2009-08-04 08:50:22: Node 3 manually start back.
3. 2009-11-12 00:xx:xx: Altered tables and added new tables.
4. 2009-11-12 00:20:47: Restart cluster
5. 2009-12-04 14:21:50: Node 2 crashed 

ndb_2_out.log
2009-12-04 14:21:50: [ndbd] WARNING  -- Ndb kernel is stuck in: Performing Send

ndb_1_cluster.log
2009-12-04 14:03:08 [MgmSrvr] WARNING  -- Node 3: Node 2 missed heartbeat 2
2009-12-04 14:03:10 [MgmSrvr] WARNING  -- Node 3: Node 2 missed heartbeat 3
2009-12-04 14:03:11 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2009-12-04 14:03:11 [MgmSrvr] WARNING  -- Node 3: Node 2 missed heartbeat 4
2009-12-04 14:03:11 [MgmSrvr] ALERT    -- Node 3: Node 2 declared dead due to missed heartbeat

6. 2009-12-04 15:23:09: Try to start node 2
7. 2009-12-04 15:24:28: Node 2 failed to start because of  Signal 11 received; Segmentation fault
8. 2009-12-04 15:27:21: Try to start node 2 again for few times until 17:19:50 but still Signal 11 received; Segmentation fault
9. 2009-12-04 17:19:50: Node 2 successfully started this time
10. 2009-12-04 21:27:46: Node 3 crashed

ndb_3_out.log
2009-12-04 21:27:46 [ndbd] INFO     -- Node 3 killed this node because GCP stop was detected

11. 2009-12-04 23:02:30: Try to start node 3 for few times until 00:26:33 but still Signal 11 received; Segmentation fault
12. 2009-12-05 00:26:33: Node 3 successfully started this time
13. 2009-12-06 12:00:04: Node 2 crashed

ndb_2_out.log
2009-12-06 12:00:04 [ndbd] WARNING  -- Ndb kernel is stuck in: Performing Send
2009-12-06 12:00:04 [ndbd] INFO     -- Watchdog: User time: 42250  System time: 21784
2009-12-06 12:35:53 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Initiated by signal 9.

ndb_1_cluster.log
2009-12-06 12:17:02 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected

14. 2009-12-06 23:00:xx: Try to start node 2 by issued the ndbd but this lead to node 2 server hang. (Server located at different place, require ppl to go the site and physically reset the machine)
15. 2009-12-06 23:xx:xx: Check at management node using ndb_mgm, shown that there is no node connected at all. (Node 3 also disconnected)
16. 2009-12-06 23:xx:xx: Service unable to access node 3 MySQL API even ndbd is running after checked by ps command.
17. 2009-12-06 23:16:xx: Kill the ndbd process at Node 3 and shutdown the mysql properly.
18. 2009-12-06 23:19:19: Start the node 3 ndbd but found Signal 11 received; Segmentation fault
19. 2009-12-06 23:56:40: node 2 server restarted physically
20. 2009-12-06 23:56:40: Try start both node 2 and 3 simultaneously for few times until 2009-12-07 02:27:57 but found Signal 11 received; Segmentation fault
21. After that we decided to start the cluster by issue –initial and restore from previous backup. Cluster work fine until now.

I found an odd situation that crashed node only manage to start back after 3 hours from the crash. Does the master node doing anything within this period causing another node received segmentation error?

2009-12-04 14:21:50 Node 2 crashed
2009-12-04 17:19:50: Node 2 successfully started after many tries

2009-12-04 21:27:46: Node 3 crashed
2009-12-05 00:26:33: Node 3 successfully started after many tries

1. Why the GCP error still occurred even after set the DiskPageBufferMemory
2. Why cluster return Segmentation fault while trying to start the cluster? (Those cases with success start up after the crash)

Please kindly suggest fix for the situation above. Thanks in million.

How to repeat:
Cannot repeat, as it happen randomly.

Suggested fix:
Before the last crash, just require start back the ndbd
On the last crash, require restore database.
[10 Dec 2009 13:09] WAI MUN ALEX LOO
Management node logs

Attachment: management node_logs.zip (application/zip, text), 309.83 KiB.

[10 Dec 2009 13:18] WAI MUN ALEX LOO
Appended by machine specification,

1 Management node 
Processors: DualCore Intel® Xeon® Processor 3.00 GHz
RAM: 2GB

2 Data nodes
RAM: 8GB
Processors: Intel Xeon 5160 DC (3.00 GHz/1333 FSB)*4MB (1 x 4MB)

And uploaded data nodes log into your ftp server name:
49585_data_node_2_logs.zip
49585_data_node_3_logs.zip
[3 Feb 2010 4:05] WAI MUN ALEX LOO
Any updates on this?
[9 Mar 2016 16:43] Gustaf Thorslund
Posted by developer:
 
This seem to be a duplicate of Bug #45154 / Bug #11753672, or at least closely related to it.

The issue with GCP stop after changes to VARCHAR size could also be related to VARCHAR occupying the full size on disk (so same as CHAR(n)). So if the VARCHAR size was increased it would cause a higher load on the system.

/Gustaf