Bug #24739 Node crash after NIC watchdog timeout
Submitted: 30 Nov 2006 23:12 Modified: 1 Dec 2006 11:21
Reporter: Adam Dixon (Candidate Quality Contributor) Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:5.0.21 OS:Linux (CentOS4)
Assigned to: Jonas Oreland CPU Architecture:Any
Tags: cluster, timeout, watchdog

[30 Nov 2006 23:12] Adam Dixon
Description:
Just thought I would post this for someone to oversee, as the operation of the cluster was fine, however the ndbd still crashed with trace (instead of handled).

For some reason (I have not been able to determine) the NIC on one of our nodes was timed out by watchdog; DMESG reported this;
NETDEV WATCHDOG: eth1: transmit timed out

Restarting the interface did not fix networking, however rebooting did. Just FYI.

Thats fine, the cluster heartbeat failed this node and life went on. Ill provide the logs and attach trace if someone wants to investigate this further.

How to repeat:
Either hardware failure of NIC, or some sort of software driver bug. (The cluster was at the lowest load period 7am)
[30 Nov 2006 23:13] Adam Dixon
--ndb-error.log--
Time: Friday 1 December 2006 - 07:37:22
Status: Unknown
Message: No message slogan found (please report a bug if you get this error code) (Unknown)
Error: 0
Error data: We(14) have been declared dead by 18 reason: Hearbeat failure(4)
Error object: QMGR (Line: 2840) 0x0000000e
Program: ndbd
Pid: 3647
Trace: /data/mysql-cluster/ndbd/ndb_14_trace.log.4
Version: Version 5.0.21
***EOM***

--ndbd-out.log--

2006-12-01 07:37:22 [ndbd] INFO     -- Error handler shutting down system
2006-12-01 07:37:23 [ndbd] INFO     -- Error handler shutdown completed - exiting
2006-12-01 07:37:23 [ndbd] ALERT    -- Node 14: Forced node shutdown completed. Initiated by signal 0.
2006-12-01 07:43:41 [ndbd] WARNING  -- Unable to report shutdown reason to 10.10.250.1:1186: Could not connect to socket : Unable to connect with connect string: nodeid=0,10.10.250.1:1186

--ndb_mgm.log--

2006-12-01 07:36:50 [MgmSrvr] INFO     -- Node 13: Local checkpoint 66999 started. Keep GCI = 10562746 oldest restorable GCI = 10562862
2006-12-01 07:37:19 [MgmSrvr] WARNING  -- Node 16: Node 14 missed heartbeat 2
2006-12-01 07:37:20 [MgmSrvr] WARNING  -- Node 16: Node 14 missed heartbeat 3
2006-12-01 07:37:21 [MgmSrvr] INFO     -- Node 1: Node 14 Connected
2006-12-01 07:37:22 [MgmSrvr] WARNING  -- Node 16: Node 14 missed heartbeat 4
2006-12-01 07:37:22 [MgmSrvr] ALERT    -- Node 16: Node 14 declared dead due to missed heartbeat
2006-12-01 07:37:22 [MgmSrvr] INFO     -- Node 13: Communication to Node 14 closed
2006-12-01 07:37:22 [MgmSrvr] INFO     -- Node 16: Communication to Node 14 closed
2006-12-01 07:37:22 [MgmSrvr] INFO     -- Node 18: Communication to Node 14 closed
2006-12-01 07:37:22 [MgmSrvr] INFO     -- Node 17: Communication to Node 14 closed
2006-12-01 07:37:22 [MgmSrvr] INFO     -- Node 11: Communication to Node 14 closed
2006-12-01 07:37:22 [MgmSrvr] INFO     -- Node 12: Communication to Node 14 closed
2006-12-01 07:37:22 [MgmSrvr] ALERT    -- Node 18: Node 14 Disconnected
2006-12-01 07:37:22 [MgmSrvr] INFO     -- Node 18: Communication to Node 14 closed
2006-12-01 07:37:22 [MgmSrvr] INFO     -- Node 15: Communication to Node 14 closed
2006-12-01 07:37:22 [MgmSrvr] ALERT    -- Node 13: Arbitration check won - node group majority
2006-12-01 07:37:22 [MgmSrvr] INFO     -- Node 13: President restarts arbitration thread [state=6]
2006-12-01 07:37:22 [MgmSrvr] ALERT    -- Node 13: Node 14 Disconnected
2006-12-01 07:37:22 [MgmSrvr] ALERT    -- Node 12: Node 14 Disconnected
2006-12-01 07:37:22 [MgmSrvr] ALERT    -- Node 11: Node 14 Disconnected
2006-12-01 07:37:22 [MgmSrvr] ALERT    -- Node 16: Node 14 Disconnected
2006-12-01 07:37:22 [MgmSrvr] ALERT    -- Node 15: Node 14 Disconnected
2006-12-01 07:37:22 [MgmSrvr] ALERT    -- Node 17: Node 14 Disconnected
2006-12-01 07:37:22 [MgmSrvr] ALERT    -- Node 11: Node 2 Disconnected
2006-12-01 07:37:22 [MgmSrvr] INFO     -- Node 11: Communication to Node 2 closed
2006-12-01 07:37:22 [MgmSrvr] ALERT    -- Node 12: Node 2 Disconnected
2006-12-01 07:37:22 [MgmSrvr] INFO     -- Node 12: Communication to Node 2 closed
2006-12-01 07:37:22 [MgmSrvr] ALERT    -- Node 13: Node 2 Disconnected
2006-12-01 07:37:22 [MgmSrvr] INFO     -- Node 13: Communication to Node 2 closed
2006-12-01 07:37:22 [MgmSrvr] ALERT    -- Node 15: Node 2 Disconnected
2006-12-01 07:37:22 [MgmSrvr] INFO     -- Node 15: Communication to Node 2 closed
2006-12-01 07:37:22 [MgmSrvr] ALERT    -- Node 16: Node 2 Disconnected
2006-12-01 07:37:22 [MgmSrvr] INFO     -- Node 16: Communication to Node 2 closed
2006-12-01 07:37:22 [MgmSrvr] ALERT    -- Node 17: Node 2 Disconnected
2006-12-01 07:37:22 [MgmSrvr] INFO     -- Node 17: Communication to Node 2 closed
2006-12-01 07:37:22 [MgmSrvr] ALERT    -- Node 18: Node 2 Disconnected
2006-12-01 07:37:22 [MgmSrvr] INFO     -- Node 18: Communication to Node 2 closed
2006-12-01 07:37:22 [MgmSrvr] INFO     -- Mgmt server state: nodeid 2 freed, m_reserved_nodes 000000000000000a.
2006-12-01 07:37:23 [MgmSrvr] INFO     -- Mgmt server state: nodeid 2 reserved for ip 10.10.250.1, m_reserved_nodes 000000000000000e.
2006-12-01 07:37:23 [MgmSrvr] INFO     -- Node 2: mysqld --server-id=-889029611
2006-12-01 07:37:26 [MgmSrvr] INFO     -- Node 17: Communication to Node 2 opened
2006-12-01 07:37:26 [MgmSrvr] INFO     -- Node 18: Communication to Node 2 opened
2006-12-01 07:37:26 [MgmSrvr] INFO     -- Node 13: Communication to Node 2 opened
2006-12-01 07:37:26 [MgmSrvr] INFO     -- Node 11: Communication to Node 2 opened
2006-12-01 07:37:26 [MgmSrvr] INFO     -- Node 12: Communication to Node 2 opened
2006-12-01 07:37:26 [MgmSrvr] INFO     -- Node 15: Communication to Node 2 opened
2006-12-01 07:37:26 [MgmSrvr] INFO     -- Node 16: Communication to Node 2 opened
2006-12-01 07:38:20 [MgmSrvr] WARNING  -- Node 12: Failure handling of node 14 has not completed in 1 min. - state = 3
2006-12-01 07:38:20 [MgmSrvr] WARNING  -- Node 15: Failure handling of node 14 has not completed in 1 min. - state = 3
2006-12-01 07:38:21 [MgmSrvr] WARNING  -- Node 16: Failure handling of node 14 has not completed in 1 min. - state = 3
2006-12-01 07:38:21 [MgmSrvr] WARNING  -- Node 17: Failure handling of node 14 has not completed in 1 min. - state = 3
2006-12-01 07:38:21 [MgmSrvr] WARNING  -- Node 18: Failure handling of node 14 has not completed in 1 min. - state = 3
2006-12-01 07:38:21 [MgmSrvr] WARNING  -- Node 13: Failure handling of node 14 has not completed in 1 min. - state = 3
2006-12-01 07:38:21 [MgmSrvr] WARNING  -- Node 11: Failure handling of node 14 has not completed in 1 min. - state = 3
2006-12-01 07:38:28 [MgmSrvr] INFO     -- Node 16: Communication to Node 14 opened
2006-12-01 07:38:28 [MgmSrvr] INFO     -- Node 17: Communication to Node 14 opened
2006-12-01 07:38:28 [MgmSrvr] INFO     -- Node 18: Communication to Node 14 opened
2006-12-01 07:38:28 [MgmSrvr] INFO     -- Node 13: Communication to Node 14 opened
2006-12-01 07:38:28 [MgmSrvr] INFO     -- Node 11: Communication to Node 14 opened
2006-12-01 07:38:28 [MgmSrvr] INFO     -- Node 12: Communication to Node 14 opened
2006-12-01 07:38:29 [MgmSrvr] INFO     -- Node 15: Communication to Node 14 opened
2006-12-01 07:38:59 [MgmSrvr] INFO     -- Node 17: Node 2 Connected
2006-12-01 07:38:59 [MgmSrvr] INFO     -- Node 18: Node 2 Connected
2006-12-01 07:38:59 [MgmSrvr] INFO     -- Node 15: Node 2 Connected
2006-12-01 07:38:59 [MgmSrvr] INFO     -- Node 15: Node 2: API version 5.0.21
2006-12-01 07:38:59 [MgmSrvr] INFO     -- Node 16: Node 2 Connected
2006-12-01 07:38:59 [MgmSrvr] INFO     -- Node 16: Node 2: API version 5.0.21
2006-12-01 07:38:59 [MgmSrvr] INFO     -- Node 17: Node 2: API version 5.0.21
2006-12-01 07:38:59 [MgmSrvr] INFO     -- Node 18: Node 2: API version 5.0.21
2006-12-01 07:38:59 [MgmSrvr] INFO     -- Node 12: Node 2 Connected
2006-12-01 07:38:59 [MgmSrvr] INFO     -- Node 11: Node 2 Connected
2006-12-01 07:38:59 [MgmSrvr] INFO     -- Node 11: Node 2: API version 5.0.21
2006-12-01 07:38:59 [MgmSrvr] INFO     -- Node 13: Node 2 Connected
2006-12-01 07:38:59 [MgmSrvr] INFO     -- Node 13: Node 2: API version 5.0.21
2006-12-01 07:38:59 [MgmSrvr] INFO     -- Node 12: Node 2: API version 5.0.21
2006-12-01 07:48:00 [MgmSrvr] INFO     -- Node 13: Local checkpoint 67000 started. Keep GCI = 10562886 oldest restorable GCI = 10562896
2006-12-01 07:52:37 [MgmSrvr] INFO     -- Node 13: Local checkpoint 67001 started. Keep GCI = 10563271 oldest restorable GCI = 10563387
[30 Nov 2006 23:14] Adam Dixon
Category - Cluster
[1 Dec 2006 11:21] Jonas Oreland
hi

i read the trace files and everything look ok.

the error+trace will be generated in all cases expect graceful shutdown, so this is ok

/jonas