Description:
A node, achieving a heartbeat failure (most likely due to heavy disk IO - rsync of BACKUP). Was forced to shutdown, but also looks to of crashed and created trace files;
Following from ndb_mgmd.log
2006-05-11 04:06:14 [MgmSrvr] INFO -- Node 18: Local checkpoint 10602 started. Keep GCI = 1532073 oldest restorable GCI = 1532226
2006-05-11 04:12:44 [MgmSrvr] WARNING -- Node 17: Node 18 missed heartbeat 2
2006-05-11 04:12:46 [MgmSrvr] WARNING -- Node 17: Node 18 missed heartbeat 3
2006-05-11 04:12:47 [MgmSrvr] INFO -- Node 1: Node 18 Connected
2006-05-11 04:12:47 [MgmSrvr] WARNING -- Node 17: Node 18 missed heartbeat 4
2006-05-11 04:12:47 [MgmSrvr] ALERT -- Node 17: Node 18 declared dead due to missed heartbeat
2006-05-11 04:12:47 [MgmSrvr] INFO -- Node 17: Communication to Node 18 closed
2006-05-11 04:12:47 [MgmSrvr] INFO -- Node 11: Communication to Node 18 closed
2006-05-11 04:12:47 [MgmSrvr] INFO -- Node 12: Communication to Node 18 closed
2006-05-11 04:12:47 [MgmSrvr] INFO -- Node 13: Communication to Node 18 closed
2006-05-11 04:12:47 [MgmSrvr] INFO -- Node 14: Communication to Node 18 closed
2006-05-11 04:12:47 [MgmSrvr] INFO -- Node 15: Communication to Node 18 closed
2006-05-11 04:12:47 [MgmSrvr] INFO -- Node 16: Communication to Node 18 closed
2006-05-11 04:12:47 [MgmSrvr] ALERT -- Node 17: Arbitration check won - node group majority
2006-05-11 04:12:47 [MgmSrvr] INFO -- Node 17: President restarts arbitration thread [state=6]
2006-05-11 04:12:47 [MgmSrvr] INFO -- Node 17: GCP Take over started
2006-05-11 04:12:47 [MgmSrvr] ALERT -- Node 14: Node 18 Disconnected
2006-05-11 04:12:47 [MgmSrvr] ALERT -- Node 16: Node 18 Disconnected
2006-05-11 04:12:47 [MgmSrvr] ALERT -- Node 12: Node 18 Disconnected
2006-05-11 04:12:47 [MgmSrvr] INFO -- Node 17: LCP Take over started
2006-05-11 04:12:47 [MgmSrvr] INFO -- Node 17: GCP Take over completed
2006-05-11 04:12:47 [MgmSrvr] INFO -- Node 17: LCP Take over completed (state = 5)
2006-05-11 04:12:47 [MgmSrvr] ALERT -- Node 17: Node 18 Disconnected
2006-05-11 04:12:47 [MgmSrvr] ALERT -- Node 11: Node 18 Disconnected
2006-05-11 04:12:47 [MgmSrvr] ALERT -- Node 15: Node 18 Disconnected
2006-05-11 04:12:47 [MgmSrvr] ALERT -- Node 13: Node 18 Disconnected
2006-05-11 04:13:15 [MgmSrvr] INFO -- Node 17: Communication to Node 18 opened
2006-05-11 04:13:15 [MgmSrvr] INFO -- Node 12: Communication to Node 18 opened
2006-05-11 04:13:16 [MgmSrvr] INFO -- Node 13: Communication to Node 18 opened
2006-05-11 04:13:16 [MgmSrvr] INFO -- Node 11: Communication to Node 18 opened
2006-05-11 04:13:16 [MgmSrvr] INFO -- Node 14: Communication to Node 18 opened
2006-05-11 04:13:16 [MgmSrvr] INFO -- Node 16: Communication to Node 18 opened
2006-05-11 04:13:16 [MgmSrvr] INFO -- Node 15: Communication to Node 18 opened
2006-05-11 04:13:16 [MgmSrvr] INFO -- Node 17: Local checkpoint 10603 started. Keep GCI = 1532271 oldest restorable GCI = 1532379
2006-05-11 04:13:28 [MgmSrvr] ALERT -- Node 18: Forced node shutdown completed. Initiated by signal 0. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart n
2006-05-11 04:18:41 [MgmSrvr] INFO -- Node 17: Local checkpoint 10604 started. Keep GCI = 1532397 oldest restorable GCI = 1532553
2006-05-11 04:24:03 [MgmSrvr] INFO -- Node 17: Local checkpoint 10605 started. Keep GCI = 1532570 oldest restorable GCI = 153272
Time: Thursday 11 May 2006 - 04:13:26
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node
(Internal error, programming error or missing error message, please
report a bug)
Error: 2303
Error data: Node 18 killed this node because GCP stop was detected
Error object: NDBCNTR (Line: 197) 0x0000000e
Program: ndbd
Pid: 32168
Trace: /data/mysql-cluster/ndbd/ndb_18_trace.log.4
Version: Version 5.0.21
***EOM***
Missing heartbeats caused the node to fail (or be arbitrarily shutdown) which is fine, Just thought I would provide the data to potentially improve handling.
How to repeat:
The rsync of numerous backups of 1.3gb in size each accross a gigabit ethernet (same interface as cluster traffic) Which inturn caused the heartbeat misfires.