Bug #21778 Network glitch causes incorrect node to shutdown
Submitted: 22 Aug 2006 9:38 Modified: 23 Aug 2006 10:09
Reporter: Lars Torstensson Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:MySQL cluster 5.0.24 OS:Linux (Redhat Linux 2.6.9-34.ELsmp #1 )
Assigned to: Jonas Oreland CPU Architecture:Any

[22 Aug 2006 9:38] Lars Torstensson
Description:
We have a 6-node 3-replica cluster.

When I temporary remove (~4sec) the network cable from node 6, node 4 ends up declared dead by node 6.

The ArbitrationTimeout is set to 6000ms in config.ini
ArbitrationTimeout: 6000

Node 4 Error log
Time: Tuesday 22 August 2006 - 10:28:53
Status: Unknown
Message: No message slogan found (please report a bug if you get this error code) (Unknown)
Error: 0
Error data: We(4) have been declared dead by 6 reason: Hearbeat failure(4)
Error object: QMGR (Line: 2840) 0x0000000e
Program: /opt/mysqlcluster/libexec/ndbd
Pid: 2539
Trace: /var/db/6-nodes/log/ndb_4_trace.log.10
Version: Version 5.0.24

Cluster log
Aug 22 10:27:13 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Local checkpoint 7244 started. Keep GCI = 91335 oldest restorable GCI = 91346
Aug 22 10:28:50 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Node 6 missed heartbeat 2
Aug 22 10:28:52 nl2-db4 NDB[16300]: [MgmSrvr] Node 7: Node 6 Connected
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 3: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 2: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 5: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Node 4 Disconnected
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 3: Node 4 Disconnected
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 3: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 5: Node 4 Disconnected
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 5: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 2: Node 4 Disconnected
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 2: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 7: Node 4 Connected
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 4: Forced node shutdown completed. Initiated by signal 6.
Aug 22 10:28:54 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Arbitration check won - node group majority
Aug 22 10:28:54 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: President restarts arbitration thread [state=6]
Aug 22 10:28:54 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: DICT: lock bs: 0 ops: 0 poll: 0 cnt: 0 queue: 
Aug 22 10:28:57 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Communication to Node 4 opened
Aug 22 10:28:57 nl2-db4 NDB[16300]: [MgmSrvr] Node 3: Communication to Node 4 opened
Aug 22 10:28:57 nl2-db4 NDB[16300]: [MgmSrvr] Node 2: Communication to Node 4 opened
Aug 22 10:28:58 nl2-db4 NDB[16300]: [MgmSrvr] Node 5: Communication to Node 4 opened
Aug 22 10:29:09 nl2-db4 NDB[16300]: [MgmSrvr] Node 7: Node 6 Connected
Aug 22 10:29:17 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Local checkpoint 7245 started. Keep GCI = 91350 oldest restorable GCI = 91361
Aug 22 10:31:30 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Local checkpoint 7246 started. Keep GCI = 91362 oldest restorable GCI = 91373
Aug 22 10:33:34 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Local checkpoint 7247 started. Keep GCI = 91375 oldest restorable GCI = 91386

How to repeat:
Temporary remove (~4sec) the network cable and have the ArbitrationTimeout set to a larger value than the glitch.

Suggested fix:
If u have to kill something kill node 6 instead, its the one with the network problem.
[22 Aug 2006 10:36] Lars Torstensson
I have uploaded bug-data-21778.tar.gz to the FTP.
[23 Aug 2006 2:02] Jonas Oreland
hm...

i think the only way to have other node killed
  is to decrease HeartbeatIntervalDbDb...

is that a good-enough solution...
[23 Aug 2006 8:05] Lars Torstensson
Yes it is.

/Lars
[23 Aug 2006 10:09] Jonas Oreland
then i'll close this as not a bug