MySQL Bugs: #16733: Signal 11 (Segmentation Fault) causes cluster shutdown

Bug #16733	Signal 11 (Segmentation Fault) causes cluster shutdown
Submitted:	23 Jan 2006 18:05	Modified:	6 Mar 2006 8:56
Reporter:	Tim Heath	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	5.0.16	OS:	Linux (SuSE Linux 9.3)
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
Running test against cluster involving mixture of Update and Select queries-- approximately 200-300 queries per second. After ~48 hours of continuous testing, ndbd process received Segmentation fault, causing cluster to shutdown. The following message was written to the ndbd server's error log:

Time: Monday 23 January 2006 - 00:55:02
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000 
Error data: Signal 11 received; Segmentation fault
Error object: main.cpp
Program: /usr/sbin/ndbd
Pid: 23123
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.17
Version: Version 5.0.16
***EOM***

Configuration:
4-server cluster: 2 ndbd nodes, 3 API nodes, 2 management nodes
All servers are Sun v20z running SuSE Linux 9.3
MySQL Cluster v5.0.16

How to repeat:
Unable to repeat at will. This same test has run for 1 week without failing.

Hi,

Can you upload /var/lib/mysql-cluster/ndb_3_trace.log.17
 and possibly you config.ini aswell

Files uploaded as requested

The tracefile showed that the error actually came from node 2.
Can you sent the tracefile from that, and the cluster log aswell

cluster log (arbitrator)

Attachment: cluster.log.node4.gz (application/gzip, text), 34.18 KiB.

Additional files uploaded

Looking at clusterlog I see the lost heartbeats.
This indicates a serious problem...
This has happened to others when e.i. there is a cron schedulted os backup or
  similar that will temporary swap out ndbd process.
Or there is _too_ high load.

The signal 11, is ofcourse incorrect, but the cluster was daying anyway...

Can you investigate further?

Jonas, thanks for the information. When I looked at both cluster logs (from the two management nodes), it looked like both ndbd nodes were experiencing missed heartbeats and that all nodes in the cluster were involved-- i.e., the two ndbd nodes were missing heartbeats from each other but were also missing heartbeats from the management nodes and the mysqld/API nodes. So, if I'm reading this correctly, it appears that this was not related to a problem on a single server (?).

There were no backups or other scheduled jobs running at the time of the failure. All of our cluster servers are running on a fiber gigE switch, and the switch did not log any failures during this period. I can't find anything in the syslog entries (/var/log/messages) on any of the boxes that would point to a server problem.

Would you say that this points to overworked ndbd nodes? If so, could you recommend any ndbd tuning adjustments? 

Thanks again for your help

>Would you say that this points to overworked ndbd nodes? If so, could you
>recommend any ndbd tuning adjustments? 

It's hard to say, it could be a bug aswell.

I would recommend, keeping track of load on machine.
Using "vmstat" to see general load on machine and "top" to see load on ndbd process.

If the load is constantly high, then there might be an overload problem.
If not it might be a bug, or a specific sql query that causing it.

/Jonas

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".