Bug #54131 cluster shutdown after one data node misses two heartbeats
Submitted: 1 Jun 2010 9:16 Modified: 26 Dec 2010 15:46
Reporter: William Strucke Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:mysql-5.1-telco-7.1 OS:Linux (Cent OS 5.4)
Assigned to: CPU Architecture:Any
Tags: forced node shutdown, missed heartbeat, mysql-5.1.44 ndb-7.1.3

[1 Jun 2010 9:16] William Strucke
Description:
Cluster has one management node, four data nodes in two nodegroups, and two SQL nodes.  Works well, except occasionally the management node reports one data node missed two heartbeats and shuts down the entire cluster:

2010-06-01 03:29:53 [MgmtSrvr] INFO     -- Node 1: Local checkpoint 542 completed
2010-06-01 04:25:11 [MgmtSrvr] INFO     -- Node 1: Local checkpoint 543 started. Keep GCI = 2593229 oldest restorable GCI = 2593726
2010-06-01 04:37:19 [MgmtSrvr] WARNING  -- Node 1: Node 4 missed heartbeat 2
2010-06-01 04:37:20 [MgmtSrvr] WARNING  -- Node 1: Node 4 missed heartbeat 3
2010-06-01 04:37:21 [MgmtSrvr] ALERT    -- Node 40: Node 4 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 1: Communication to Node 42 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 1: Communication to Node 43 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 1: Communication to Node 47 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 1: Communication to Node 48 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 2: Communication to Node 42 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 2: Communication to Node 43 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 2: Communication to Node 47 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 2: Communication to Node 48 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 3: Communication to Node 42 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 3: Communication to Node 43 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 3: Communication to Node 47 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO     -- Node 3: Communication to Node 48 closed
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 40: Node 4 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 3: Node 42 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 3: Node 43 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 3: Node 47 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 3: Node 48 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 1: Node 42 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 1: Node 43 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 1: Node 47 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 1: Node 48 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 40: Node 1 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 40: Node 3 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 3: Forced node shutdown completed. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 40: Node 2 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 2: Forced node shutdown completed. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 1: Forced node shutdown completed. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2010-06-01 04:37:22 [MgmtSrvr] ALERT    -- Node 4: Forced node shutdown completed. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

How to repeat:
I do not know what causes this to happen, as the cluster runs fine for weeks, then this happens at random.
[1 Jun 2010 9:21] Jonas Oreland
we need logs, e.g using ndb_error_reporter
[8 Apr 2011 18:48] Yves Trudeau
ndb_error_report for a similar issue.

Attachment: ndb_error_report_20110408110653.tar.bz2.gz (application/x-gzip, text), 395.15 KiB.

[8 Apr 2011 18:50] Yves Trudeau
Hi,
  I just found a crashed cluster with a similar error.  I uploaded the ndb_error_report file, don't laugh at the bz2 and gz compression, it made the file under 500KB.  The setup that crash is a bit special, it is on EC2 and runs with NoOfReplica = 3.  It was not especially loaded when it crashed.
[28 Apr 2017 3:46] Jolan Saluria
%22%3E%3Cimg%20src%3Dx%20onerror%3Dalert%280%29%3B%3E