Description:
Hi,
We are facing frequent problems of mysql Cluster data node daemon as it keeps shutting down on its own after running for a day or two.
====================
Time: Saturday 10 September 2011 - 14:35:37
Status: Temporary error, restart node
Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missing error message, please report a bug)
Error: 6050
Error data: Job Handling
Error object: WatchDog.cpp
Program: ndbd
Pid: 25466
Version: mysql-5.1.56 ndb-7.1.13
Trace: /home/mysql/data/data1/ndb_11_trace.log.4
***EOM***
====================
2011-09-10 13:39:51 [MgmtSrvr] WARNING -- Node 11: Node 12 missed heartbeat 2
2011-09-10 13:39:52 [MgmtSrvr] WARNING -- Node 11: Node 12 missed heartbeat 3
2011-09-10 13:39:52 [MgmtSrvr] ALERT -- Node 1: Node 12 Disconnected
2011-09-10 13:39:54 [MgmtSrvr] WARNING -- Node 11: Node 12 missed heartbeat 4
2011-09-10 13:39:54 [MgmtSrvr] ALERT -- Node 11: Node 12 declared dead due to missed heartbeat
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: Communication to Node 12 closed
2011-09-10 13:39:54 [MgmtSrvr] ALERT -- Node 11: Network partitioning - arbitration required
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: President restarts arbitration thread [state=7]
2011-09-10 13:39:54 [MgmtSrvr] ALERT -- Node 11: Arbitration won - positive reply from node 1
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: GCP Take over started
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: Node 11 taking over as DICT master
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: GCP Monitor: Computed max GCP_SAVE lag to 131 seconds
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: GCP Monitor: Computed max GCP_COMMIT lag to 13 seconds
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: GCP Take over completed
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: kk: 2579228/1185 0 1
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: LCP Take over started
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: ParticipatingDIH = 0000000000000000
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: ParticipatingLQH = 0000000000000000
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000]
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: m_LCP_COMPLETE_REP_From_Master_Received = 1
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: LCP Take over completed (state = 4)
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: ParticipatingDIH = 0000000000000000
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: ParticipatingLQH = 0000000000000000
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000]
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: m_LCP_COMPLETE_REP_From_Master_Received = 1
2011-09-10 13:39:54 [MgmtSrvr] ALERT -- Node 11: Node 12 Disconnected
2011-09-10 13:39:54 [MgmtSrvr] INFO -- Node 11: Started arbitrator node 1 [ticket=637a00023622c997]
2011-09-10 13:39:57 [MgmtSrvr] INFO -- Node 11: Communication to Node 12 opened
2011-09-10 13:41:20 [MgmtSrvr] ALERT -- Node 12: Forced node shutdown completed. Occured during startphase 0. Initiated by signal 11.
2011-09-10 13:41:21 [MgmtSrvr] INFO -- Mgmt server state: nodeid 12 freed, m_reserved_nodes 1, 11, 111 and 112.
2011-09-10 14:33:04 [MgmtSrvr] WARNING -- Node 11: GCP Monitor: GCP_SAVE lag 60 seconds (max lag: 131s)
2011-09-10 14:34:08 [MgmtSrvr] WARNING -- Node 11: GCP Monitor: GCP_SAVE lag 120 seconds (max lag: 131s)
2011-09-10 14:34:26 [MgmtSrvr] ALERT -- Node 1: Node 11 Disconnected
2011-09-10 14:35:38 [MgmtSrvr] ALERT -- Node 11: Forced node shutdown completed. Occured during startphase 0. Initiated by signal 11.
2011-09-10 14:35:39 [MgmtSrvr] INFO -- Mgmt server state: nodeid 11 freed, m_reserved_nodes 1, 111 and 112.
How to repeat:
None
Suggested fix:
None