MySQL Bugs: #54131: cluster shutdown after one data node misses two heartbeats

Bug #54131	cluster shutdown after one data node misses two heartbeats
Submitted:	1 Jun 2010 9:16	Modified:	26 Dec 2010 15:46
Reporter:	William Strucke	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-5.1-telco-7.1	OS:	Linux (Cent OS 5.4)
Assigned to:		CPU Architecture:	Any
Tags:	forced node shutdown, missed heartbeat, mysql-5.1.44 ndb-7.1.3

Description:
Cluster has one management node, four data nodes in two nodegroups, and two SQL nodes.  Works well, except occasionally the management node reports one data node missed two heartbeats and shuts down the entire cluster:

2010-06-01 03:29:53 [MgmtSrvr] INFO     -- Node 1: Local checkpoint 542 completed
2010-06-01 04:25:11 [MgmtSrvr] INFO     -- Node 1: Local checkpoint 543 started. Keep GCI = 2593229 oldest restorable GCI = 2593726
2010-06-01 04:37:19 [MgmtSrvr] WARNING  -- Node 1: Node 4 missed heartbeat 2
2010-06-01 04:37:20 [MgmtSrvr] WARNING  -- Node 1: Node 4 missed heartbeat 3
2010-06-01 04:37:21 [MgmtSrvr] ALERT    -- Node 40: Node 4 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 1: Communication to Node 42 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 1: Communication to Node 43 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 1: Communication to Node 47 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 1: Communication to Node 48 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 2: Communication to Node 42 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 2: Communication to Node 43 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 2: Communication to Node 47 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 2: Communication to Node 48 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 3: Communication to Node 42 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 3: Communication to Node 43 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 3: Communication to Node 47 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 3: Communication to Node 48 closed
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 40: Node 4 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 3: Node 42 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 3: Node 43 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 3: Node 47 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 3: Node 48 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 1: Node 42 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 1: Node 43 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 1: Node 47 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 1: Node 48 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 40: Node 1 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 40: Node 3 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 3: Forced node shutdown completed. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 40: Node 2 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 2: Forced node shutdown completed. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 1: Forced node shutdown completed. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 4: Forced node shutdown completed. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

How to repeat:
I do not know what causes this to happen, as the cluster runs fine for weeks, then this happens at random.

we need logs, e.g using ndb_error_reporter

ndb_error_report for a similar issue.

Attachment: ndb_error_report_20110408110653.tar.bz2.gz (application/x-gzip, text), 395.15 KiB.

Hi,
  I just found a crashed cluster with a similar error.  I uploaded the ndb_error_report file, don't laugh at the bz2 and gz compression, it made the file under 500KB.  The setup that crash is a bit special, it is on EC2 and runs with NoOfReplica = 3.  It was not especially loaded when it crashed.

%22%3E%3Cimg%20src%3Dx%20onerror%3Dalert%280%29%3B%3E