Description:
Cluster has one management node, four data nodes in two nodegroups, and two SQL nodes. Works well, except occasionally the management node reports one data node missed two heartbeats and shuts down the entire cluster:
2010-06-01 03:29:53 [MgmtSrvr] INFO -- Node 1: Local checkpoint 542 completed
2010-06-01 04:25:11 [MgmtSrvr] INFO -- Node 1: Local checkpoint 543 started. Keep GCI = 2593229 oldest restorable GCI = 2593726
2010-06-01 04:37:19 [MgmtSrvr] WARNING -- Node 1: Node 4 missed heartbeat 2
2010-06-01 04:37:20 [MgmtSrvr] WARNING -- Node 1: Node 4 missed heartbeat 3
2010-06-01 04:37:21 [MgmtSrvr] ALERT -- Node 40: Node 4 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] INFO -- Node 1: Communication to Node 42 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO -- Node 1: Communication to Node 43 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO -- Node 1: Communication to Node 47 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO -- Node 1: Communication to Node 48 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO -- Node 2: Communication to Node 42 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO -- Node 2: Communication to Node 43 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO -- Node 2: Communication to Node 47 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO -- Node 2: Communication to Node 48 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO -- Node 3: Communication to Node 42 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO -- Node 3: Communication to Node 43 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO -- Node 3: Communication to Node 47 closed
2010-06-01 04:37:22 [MgmtSrvr] INFO -- Node 3: Communication to Node 48 closed
2010-06-01 04:37:22 [MgmtSrvr] ALERT -- Node 40: Node 4 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT -- Node 3: Node 42 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT -- Node 3: Node 43 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT -- Node 3: Node 47 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT -- Node 3: Node 48 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT -- Node 1: Node 42 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT -- Node 1: Node 43 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT -- Node 1: Node 47 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT -- Node 1: Node 48 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT -- Node 40: Node 1 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT -- Node 40: Node 3 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT -- Node 3: Forced node shutdown completed. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2010-06-01 04:37:22 [MgmtSrvr] ALERT -- Node 40: Node 2 Disconnected
2010-06-01 04:37:22 [MgmtSrvr] ALERT -- Node 2: Forced node shutdown completed. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2010-06-01 04:37:22 [MgmtSrvr] ALERT -- Node 1: Forced node shutdown completed. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2010-06-01 04:37:22 [MgmtSrvr] ALERT -- Node 4: Forced node shutdown completed. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
How to repeat:
I do not know what causes this to happen, as the cluster runs fine for weeks, then this happens at random.