Description:
We have a 6-node 3-replica cluster.
When I temporary remove (~4sec) the network cable from node 6, node 4 ends up declared dead by node 6.
The ArbitrationTimeout is set to 6000ms in config.ini
ArbitrationTimeout: 6000
Node 4 Error log
Time: Tuesday 22 August 2006 - 10:28:53
Status: Unknown
Message: No message slogan found (please report a bug if you get this error code) (Unknown)
Error: 0
Error data: We(4) have been declared dead by 6 reason: Hearbeat failure(4)
Error object: QMGR (Line: 2840) 0x0000000e
Program: /opt/mysqlcluster/libexec/ndbd
Pid: 2539
Trace: /var/db/6-nodes/log/ndb_4_trace.log.10
Version: Version 5.0.24
Cluster log
Aug 22 10:27:13 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Local checkpoint 7244 started. Keep GCI = 91335 oldest restorable GCI = 91346
Aug 22 10:28:50 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Node 6 missed heartbeat 2
Aug 22 10:28:52 nl2-db4 NDB[16300]: [MgmSrvr] Node 7: Node 6 Connected
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 3: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 2: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 5: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Node 4 Disconnected
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 3: Node 4 Disconnected
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 3: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 5: Node 4 Disconnected
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 5: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 2: Node 4 Disconnected
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 2: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 7: Node 4 Connected
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 4: Forced node shutdown completed. Initiated by signal 6.
Aug 22 10:28:54 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Arbitration check won - node group majority
Aug 22 10:28:54 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: President restarts arbitration thread [state=6]
Aug 22 10:28:54 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: DICT: lock bs: 0 ops: 0 poll: 0 cnt: 0 queue:
Aug 22 10:28:57 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Communication to Node 4 opened
Aug 22 10:28:57 nl2-db4 NDB[16300]: [MgmSrvr] Node 3: Communication to Node 4 opened
Aug 22 10:28:57 nl2-db4 NDB[16300]: [MgmSrvr] Node 2: Communication to Node 4 opened
Aug 22 10:28:58 nl2-db4 NDB[16300]: [MgmSrvr] Node 5: Communication to Node 4 opened
Aug 22 10:29:09 nl2-db4 NDB[16300]: [MgmSrvr] Node 7: Node 6 Connected
Aug 22 10:29:17 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Local checkpoint 7245 started. Keep GCI = 91350 oldest restorable GCI = 91361
Aug 22 10:31:30 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Local checkpoint 7246 started. Keep GCI = 91362 oldest restorable GCI = 91373
Aug 22 10:33:34 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Local checkpoint 7247 started. Keep GCI = 91375 oldest restorable GCI = 91386
How to repeat:
Temporary remove (~4sec) the network cable and have the ArbitrationTimeout set to a larger value than the glitch.
Suggested fix:
If u have to kill something kill node 6 instead, its the one with the network problem.