MySQL Bugs: #52039: cluster restart through various node failures triggered by ndb

Bug #52039	cluster restart through various node failures triggered by ndb_mgmd failure
Submitted:	13 Mar 2010 23:28	Modified:	6 Jun 2010 11:52
Reporter:	Robert Klikics	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-5.1-telco-7.0	OS:	Linux (Debian 5.0)
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	node ndb_mgmd failure telco-7.0.9b

Description:
About 30 minutes ago, our complete cluster was killed through several node failures seemingly triggered by a ndb_mgmd failure. Starting with some log entries like 

2010-03-13 23:20:56 [MgmtSrvr] WARNING  -- Node 3: Node 5 missed heartbeat 2
2010-03-13 23:20:58 [MgmtSrvr] WARNING  -- Node 3: Node 5 missed heartbeat 3
2010-03-13 23:21:02 [MgmtSrvr] WARNING  -- Node 3: Node 5 missed heartbeat 2
2010-03-13 23:21:07 [MgmtSrvr] WARNING  -- Node 3: Node 5 missed heartbeat 2
2010-03-13 23:21:08 [MgmtSrvr] ALERT    -- Node 2: Node 10 Disconnected
2010-03-13 23:21:08 [MgmtSrvr] WARNING  -- Node 2: Node 10 missed heartbeat 2

all nodes are disconneted from the ndb_mgmd server. Network latency/throughput was ok (round about 2%-5%, normal latency), no network errors on the cluster/api nodes, load on the node's was about 25%. The ndb_mgmd seems to stuck and was not responding (e.g. ndb_mgm -e show does not work and hangs).

After a restart of the ndb_mgmd, the node's have tried to reconnect to the ndb_mgmd, but some of the data node's have segfaulted, thus we've to restart the data node's completely.

A ndb_error_reporter report is attached under the following url:
http://85.25.144.101/files/ndb_error_report_20100313235936.tar.bz2

Thanks in advance
Martin

How to repeat:
No idea at this moment.

Hi,

I've analyzed the logs, 
i don't know how you measure that latency was up "2-5%"
but from what i can see, it sure looks like some kind
of (transient) network failure.

I did find an incorrectly formated error message, 
ndbrequire for node 3, which turned out to be Bug #48852,
fixed in 7.0.10.

Unless you manage to reproduce this, or present some other information
I think this bug will not move forward.

Setting status to "Need-feedback"

/Jonas

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".