Bug #16733 Signal 11 (Segmentation Fault) causes cluster shutdown
Submitted: 23 Jan 2006 18:05 Modified: 6 Mar 2006 8:56
Reporter: Tim Heath Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.0.16 OS:Linux (SuSE Linux 9.3)
Assigned to: Assigned Account CPU Architecture:Any

[23 Jan 2006 18:05] Tim Heath
Description:
Running test against cluster involving mixture of Update and Select queries-- approximately 200-300 queries per second. After ~48 hours of continuous testing, ndbd process received Segmentation fault, causing cluster to shutdown. The following message was written to the ndbd server's error log:

Time: Monday 23 January 2006 - 00:55:02
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000 
Error data: Signal 11 received; Segmentation fault
Error object: main.cpp
Program: /usr/sbin/ndbd
Pid: 23123
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.17
Version: Version 5.0.16
***EOM***

Configuration:
4-server cluster: 2 ndbd nodes, 3 API nodes, 2 management nodes
All servers are Sun v20z running SuSE Linux 9.3
MySQL Cluster v5.0.16

How to repeat:
Unable to repeat at will. This same test has run for 1 week without failing.
[23 Jan 2006 18:20] Jonas Oreland
Hi,

Can you upload /var/lib/mysql-cluster/ndb_3_trace.log.17
 and possibly you config.ini aswell
[23 Jan 2006 19:06] Tim Heath
Files uploaded as requested
[23 Jan 2006 19:25] Jonas Oreland
The tracefile showed that the error actually came from node 2.
Can you sent the tracefile from that, and the cluster log aswell
[23 Jan 2006 20:11] Tim Heath
cluster log (arbitrator)

Attachment: cluster.log.node4.gz (application/gzip, text), 34.18 KiB.

[23 Jan 2006 20:12] Tim Heath
Additional files uploaded
[25 Jan 2006 10:35] Jonas Oreland
Looking at clusterlog I see the lost heartbeats.
This indicates a serious problem...
This has happened to others when e.i. there is a cron schedulted os backup or
  similar that will temporary swap out ndbd process.
Or there is _too_ high load.

The signal 11, is ofcourse incorrect, but the cluster was daying anyway...

Can you investigate further?
[25 Jan 2006 15:53] Tim Heath
Jonas, thanks for the information. When I looked at both cluster logs (from the two management nodes), it looked like both ndbd nodes were experiencing missed heartbeats and that all nodes in the cluster were involved-- i.e., the two ndbd nodes were missing heartbeats from each other but were also missing heartbeats from the management nodes and the mysqld/API nodes. So, if I'm reading this correctly, it appears that this was not related to a problem on a single server (?).

There were no backups or other scheduled jobs running at the time of the failure. All of our cluster servers are running on a fiber gigE switch, and the switch did not log any failures during this period. I can't find anything in the syslog entries (/var/log/messages) on any of the boxes that would point to a server problem.

Would you say that this points to overworked ndbd nodes? If so, could you recommend any ndbd tuning adjustments? 

Thanks again for your help
[6 Feb 2006 8:56] Jonas Oreland
>Would you say that this points to overworked ndbd nodes? If so, could you
>recommend any ndbd tuning adjustments? 

It's hard to say, it could be a bug aswell.

I would recommend, keeping track of load on machine.
Using "vmstat" to see general load on machine and "top" to see load on ndbd process.

If the load is constantly high, then there might be an overload problem.
If not it might be a bug, or a specific sql query that causing it.

/Jonas
[7 Mar 2006 0:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".