MySQL Bugs: #21778: Network glitch causes incorrect node to shutdown

Bug #21778	Network glitch causes incorrect node to shutdown
Submitted:	22 Aug 2006 9:38	Modified:	23 Aug 2006 10:09
Reporter:	Lars Torstensson	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	MySQL cluster 5.0.24	OS:	Linux (Redhat Linux 2.6.9-34.ELsmp #1 )
Assigned to:	Jonas Oreland	CPU Architecture:	Any

Description:
We have a 6-node 3-replica cluster.

When I temporary remove (~4sec) the network cable from node 6, node 4 ends up declared dead by node 6.

The ArbitrationTimeout is set to 6000ms in config.ini
ArbitrationTimeout: 6000

Node 4 Error log
Time: Tuesday 22 August 2006 - 10:28:53
Status: Unknown
Message: No message slogan found (please report a bug if you get this error code) (Unknown)
Error: 0
Error data: We(4) have been declared dead by 6 reason: Hearbeat failure(4)
Error object: QMGR (Line: 2840) 0x0000000e
Program: /opt/mysqlcluster/libexec/ndbd
Pid: 2539
Trace: /var/db/6-nodes/log/ndb_4_trace.log.10
Version: Version 5.0.24

Cluster log
Aug 22 10:27:13 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Local checkpoint 7244 started. Keep GCI = 91335 oldest restorable GCI = 91346
Aug 22 10:28:50 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Node 6 missed heartbeat 2
Aug 22 10:28:52 nl2-db4 NDB[16300]: [MgmSrvr] Node 7: Node 6 Connected
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 3: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 2: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 5: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Node 4 Disconnected
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 3: Node 4 Disconnected
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 3: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 5: Node 4 Disconnected
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 5: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 2: Node 4 Disconnected
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 2: Communication to Node 4 closed
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 7: Node 4 Connected
Aug 22 10:28:53 nl2-db4 NDB[16300]: [MgmSrvr] Node 4: Forced node shutdown completed. Initiated by signal 6.
Aug 22 10:28:54 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Arbitration check won - node group majority
Aug 22 10:28:54 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: President restarts arbitration thread [state=6]
Aug 22 10:28:54 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: DICT: lock bs: 0 ops: 0 poll: 0 cnt: 0 queue: 
Aug 22 10:28:57 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Communication to Node 4 opened
Aug 22 10:28:57 nl2-db4 NDB[16300]: [MgmSrvr] Node 3: Communication to Node 4 opened
Aug 22 10:28:57 nl2-db4 NDB[16300]: [MgmSrvr] Node 2: Communication to Node 4 opened
Aug 22 10:28:58 nl2-db4 NDB[16300]: [MgmSrvr] Node 5: Communication to Node 4 opened
Aug 22 10:29:09 nl2-db4 NDB[16300]: [MgmSrvr] Node 7: Node 6 Connected
Aug 22 10:29:17 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Local checkpoint 7245 started. Keep GCI = 91350 oldest restorable GCI = 91361
Aug 22 10:31:30 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Local checkpoint 7246 started. Keep GCI = 91362 oldest restorable GCI = 91373
Aug 22 10:33:34 nl2-db4 NDB[16300]: [MgmSrvr] Node 1: Local checkpoint 7247 started. Keep GCI = 91375 oldest restorable GCI = 91386

How to repeat:
Temporary remove (~4sec) the network cable and have the ArbitrationTimeout set to a larger value than the glitch.

Suggested fix:
If u have to kill something kill node 6 instead, its the one with the network problem.

I have uploaded bug-data-21778.tar.gz to the FTP.

hm...

i think the only way to have other node killed
  is to decrease HeartbeatIntervalDbDb...

is that a good-enough solution...

Yes it is.

/Lars

then i'll close this as not a bug