Bug #52039 cluster restart through various node failures triggered by ndb_mgmd failure
Submitted: 13 Mar 2010 23:28 Modified: 6 Jun 2010 11:52
Reporter: Robert Klikics Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:mysql-5.1-telco-7.0 OS:Linux (Debian 5.0)
Assigned to: Assigned Account CPU Architecture:Any
Tags: node ndb_mgmd failure telco-7.0.9b

[13 Mar 2010 23:28] Robert Klikics
Description:
About 30 minutes ago, our complete cluster was killed through several node failures seemingly triggered by a ndb_mgmd failure. Starting with some log entries like 

2010-03-13 23:20:56 [MgmtSrvr] WARNING  -- Node 3: Node 5 missed heartbeat 2
2010-03-13 23:20:58 [MgmtSrvr] WARNING  -- Node 3: Node 5 missed heartbeat 3
2010-03-13 23:21:02 [MgmtSrvr] WARNING  -- Node 3: Node 5 missed heartbeat 2
2010-03-13 23:21:07 [MgmtSrvr] WARNING  -- Node 3: Node 5 missed heartbeat 2
2010-03-13 23:21:08 [MgmtSrvr] ALERT    -- Node 2: Node 10 Disconnected
2010-03-13 23:21:08 [MgmtSrvr] WARNING  -- Node 2: Node 10 missed heartbeat 2

all nodes are disconneted from the ndb_mgmd server. Network latency/throughput was ok (round about 2%-5%, normal latency), no network errors on the cluster/api nodes, load on the node's was about 25%. The ndb_mgmd seems to stuck and was not responding (e.g. ndb_mgm -e show does not work and hangs).

After a restart of the ndb_mgmd, the node's have tried to reconnect to the ndb_mgmd, but some of the data node's have segfaulted, thus we've to restart the data node's completely.

A ndb_error_reporter report is attached under the following url:
http://85.25.144.101/files/ndb_error_report_20100313235936.tar.bz2

Thanks in advance
Martin

How to repeat:
No idea at this moment.
[6 May 2010 11:52] Jonas Oreland
Hi,

I've analyzed the logs, 
i don't know how you measure that latency was up "2-5%"
but from what i can see, it sure looks like some kind
of (transient) network failure.

I did find an incorrectly formated error message, 
ndbrequire for node 3, which turned out to be Bug #48852,
fixed in 7.0.10.

Unless you manage to reproduce this, or present some other information
I think this bug will not move forward.

Setting status to "Need-feedback"

/Jonas
[6 Jun 2010 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".