Description:
Machine: 4 x DualCore Xeon 7140 CPUs (IBM x3850)
RAM: 32GB
Disk: 300GB (RAID 10)
The cluster comprises a single NDB node, running locally along with the management daemon and the MYSQLD process.
NDB crashes with the following error sporadically, even when idle:
Client Side:
-------------
java.sql.SQLException: Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER
Database Side:
--------------
Time: Sunday 23 October 2011 - 20:49:11
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug)
Error: 2303
Error data: Node 2 killed this node because GCP stop was detected
Error object: NDBCNTR (Line: 276) 0x00000002
Program: ndbmtd
The system does not recover and requires manual restart of the nodes.
NDB Output Log:
---------------
c_nodeStartMaster.blockGcp: 0 4294967040
m_gcp_save.m_counter: 44 m_gcp_save.m_max_lag: 1210
m_micro_gcp.m_counter: 41 m_micro_gcp.m_max_lag: 41
m_gcp_save.m_state: 0
m_gcp_save.m_master.m_state: 0
m_micro_gcp.m_state: 2
m_micro_gcp.m_master.m_state: 2
c_COPY_GCIREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_COPY_TABREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_CREATE_FRAGREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_DIH_SWITCH_REPLICA_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_EMPTY_LCP_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_GCP_COMMIT_Counter = [SignalCounter: m_count=1 0000000000000004]
c_GCP_PREPARE_Counter = [SignalCounter: m_count=0 0000000000000000]
c_GCP_SAVEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_SUB_GCP_COMPLETE_REP_Counter = [SignalCounter: m_count=0 0000000000000000]
c_INCL_NODEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_MASTER_GCPREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_MASTER_LCPREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_START_INFOREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_START_RECREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_STOP_ME_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_TC_CLOPSIZEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_TCGETOPSIZEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
m_copyReason: 0 m_waiting: 0 0
c_copyGCISlave: sender{Data, Ref} 2 f60002 reason: 0 nextWord: 0
Detected GCP stop(2)...sending kill to [SignalCounter: m_count=1 0000000000000004]
c_nodeStartMaster.blockGcp: 0 4294967040
m_gcp_save.m_counter: 0 m_gcp_save.m_max_lag: 1210
m_micro_gcp.m_counter: 0 m_micro_gcp.m_max_lag: 41
m_gcp_save.m_state: 0
m_gcp_save.m_master.m_state: 0
m_micro_gcp.m_state: 2
m_micro_gcp.m_master.m_state: 2
c_COPY_GCIREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_COPY_TABREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_CREATE_FRAGREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_DIH_SWITCH_REPLICA_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_EMPTY_LCP_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_GCP_COMMIT_Counter = [SignalCounter: m_count=1 0000000000000004]
c_GCP_PREPARE_Counter = [SignalCounter: m_count=0 0000000000000000]
c_GCP_SAVEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_SUB_GCP_COMPLETE_REP_Counter = [SignalCounter: m_count=0 0000000000000000]
c_INCL_NODEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_MASTER_GCPREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_MASTER_LCPREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_START_INFOREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_START_RECREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_STOP_ME_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_TC_CLOPSIZEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_TCGETOPSIZEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
m_copyReason: 0 m_waiting: 0 0
c_copyGCISlave: sender{Data, Ref} 2 f60002 reason: 0 nextWord: 0
file[0] status: 2 type: 1 reqStatus: 0 file1: 2 1 0
c_nodeStartMaster.blockGcp: 0 4294967040
m_gcp_save.m_counter: 0 m_gcp_save.m_max_lag: 1210
m_micro_gcp.m_counter: 0 m_micro_gcp.m_max_lag: 41
m_gcp_save.m_state: 0
m_gcp_save.m_master.m_state: 0
m_micro_gcp.m_state: 2
m_micro_gcp.m_master.m_state: 2
c_COPY_GCIREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_COPY_TABREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_CREATE_FRAGREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_DIH_SWITCH_REPLICA_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_EMPTY_LCP_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_GCP_COMMIT_Counter = [SignalCounter: m_count=1 0000000000000004]
c_GCP_PREPARE_Counter = [SignalCounter: m_count=0 0000000000000000]
c_GCP_SAVEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_SUB_GCP_COMPLETE_REP_Counter = [SignalCounter: m_count=0 0000000000000000]
c_INCL_NODEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_MASTER_GCPREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_MASTER_LCPREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_START_INFOREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_START_RECREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_STOP_ME_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_TC_CLOPSIZEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_TCGETOPSIZEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
m_copyReason: 0 m_waiting: 0 0
c_copyGCISlave: sender{Data, Ref} 2 f60002 reason: 0 nextWord: 0
2011-10-23 20:49:11 [ndbd] INFO -- Node 2 killed this node because GCP stop was detected
2011-10-23 20:49:11 [ndbd] INFO -- NDBCNTR (Line: 276) 0x00000002
2011-10-23 20:49:11 [ndbd] INFO -- Error handler shutting down system
2011-10-23 20:49:11 [ndbd] INFO -- Error handler shutdown completed - exiting
2011-10-23 20:49:18 [ndbd] ALERT -- Node 2: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
How to repeat:
No clear procedure as this occurs randomly.
This may be possible to repeat with our specific database and system.
Trace logs are available.