MySQL Bugs: #62899: NDB crashes with GCP Stop with no apparent reason, even when idle

Bug #62899	NDB crashes with GCP Stop with no apparent reason, even when idle
Submitted:	23 Oct 2011 20:10	Modified:	9 Jul 2015 12:07
Reporter:	Avi Bluestein	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-5.1.56 ndb-7.1.15a	OS:	Linux (CentOS 6)
Assigned to:	MySQL Verification Team	CPU Architecture:	Any
Tags:	crash, GCP, ndb, NDBCNTR

Description:
Machine: 4 x DualCore Xeon 7140 CPUs (IBM x3850)
RAM: 32GB 
Disk: 300GB (RAID 10)

The cluster comprises a single NDB node, running locally along with the management daemon and the MYSQLD process.

NDB crashes with the following error sporadically, even when idle:

Client Side:
-------------
java.sql.SQLException: Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER

Database Side:
--------------
Time: Sunday 23 October 2011 - 20:49:11
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug)
Error: 2303
Error data: Node 2 killed this node because GCP stop was detected
Error object: NDBCNTR (Line: 276) 0x00000002
Program: ndbmtd

The system does not recover and requires manual restart of the nodes.

NDB Output Log:
---------------
c_nodeStartMaster.blockGcp: 0 4294967040
m_gcp_save.m_counter: 44 m_gcp_save.m_max_lag: 1210
m_micro_gcp.m_counter: 41 m_micro_gcp.m_max_lag: 41
m_gcp_save.m_state: 0
m_gcp_save.m_master.m_state: 0
m_micro_gcp.m_state: 2
m_micro_gcp.m_master.m_state: 2
c_COPY_GCIREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_COPY_TABREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_CREATE_FRAGREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_DIH_SWITCH_REPLICA_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_EMPTY_LCP_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_GCP_COMMIT_Counter = [SignalCounter: m_count=1 0000000000000004]
c_GCP_PREPARE_Counter = [SignalCounter: m_count=0 0000000000000000]
c_GCP_SAVEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_SUB_GCP_COMPLETE_REP_Counter = [SignalCounter: m_count=0 0000000000000000]
c_INCL_NODEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_MASTER_GCPREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_MASTER_LCPREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_START_INFOREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_START_RECREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_STOP_ME_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_TC_CLOPSIZEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_TCGETOPSIZEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
m_copyReason: 0 m_waiting: 0 0
c_copyGCISlave: sender{Data, Ref} 2 f60002 reason: 0 nextWord: 0
Detected GCP stop(2)...sending kill to [SignalCounter: m_count=1 0000000000000004]
c_nodeStartMaster.blockGcp: 0 4294967040
m_gcp_save.m_counter: 0 m_gcp_save.m_max_lag: 1210
m_micro_gcp.m_counter: 0 m_micro_gcp.m_max_lag: 41
m_gcp_save.m_state: 0
m_gcp_save.m_master.m_state: 0
m_micro_gcp.m_state: 2
m_micro_gcp.m_master.m_state: 2
c_COPY_GCIREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_COPY_TABREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_CREATE_FRAGREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_DIH_SWITCH_REPLICA_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_EMPTY_LCP_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_GCP_COMMIT_Counter = [SignalCounter: m_count=1 0000000000000004]
c_GCP_PREPARE_Counter = [SignalCounter: m_count=0 0000000000000000]
c_GCP_SAVEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_SUB_GCP_COMPLETE_REP_Counter = [SignalCounter: m_count=0 0000000000000000]
c_INCL_NODEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_MASTER_GCPREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_MASTER_LCPREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_START_INFOREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_START_RECREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_STOP_ME_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_TC_CLOPSIZEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_TCGETOPSIZEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
m_copyReason: 0 m_waiting: 0 0
c_copyGCISlave: sender{Data, Ref} 2 f60002 reason: 0 nextWord: 0
file[0] status: 2 type: 1 reqStatus: 0 file1: 2 1 0
c_nodeStartMaster.blockGcp: 0 4294967040
m_gcp_save.m_counter: 0 m_gcp_save.m_max_lag: 1210
m_micro_gcp.m_counter: 0 m_micro_gcp.m_max_lag: 41
m_gcp_save.m_state: 0
m_gcp_save.m_master.m_state: 0
m_micro_gcp.m_state: 2
m_micro_gcp.m_master.m_state: 2
c_COPY_GCIREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_COPY_TABREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_CREATE_FRAGREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_DIH_SWITCH_REPLICA_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_EMPTY_LCP_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_GCP_COMMIT_Counter = [SignalCounter: m_count=1 0000000000000004]
c_GCP_PREPARE_Counter = [SignalCounter: m_count=0 0000000000000000]
c_GCP_SAVEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_SUB_GCP_COMPLETE_REP_Counter = [SignalCounter: m_count=0 0000000000000000]
c_INCL_NODEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_MASTER_GCPREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_MASTER_LCPREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_START_INFOREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_START_RECREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_STOP_ME_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_TC_CLOPSIZEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_TCGETOPSIZEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
m_copyReason: 0 m_waiting: 0 0
c_copyGCISlave: sender{Data, Ref} 2 f60002 reason: 0 nextWord: 0
2011-10-23 20:49:11 [ndbd] INFO     -- Node 2 killed this node because GCP stop was detected
2011-10-23 20:49:11 [ndbd] INFO     -- NDBCNTR (Line: 276) 0x00000002
2011-10-23 20:49:11 [ndbd] INFO     -- Error handler shutting down system
2011-10-23 20:49:11 [ndbd] INFO     -- Error handler shutdown completed - exiting
2011-10-23 20:49:18 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

How to repeat:
No clear procedure as this occurs randomly.
This may be possible to repeat with our specific database and system.

Trace logs are available.

Hi,

Thanks for the report. 

I order to analyze this crash properly we need a whole set of logs, not just the logs of the crashed node. ndb_error_reporter tool should be used to collect the logs.

What I can tell even before analyzing the logs, we fixed number of LCP and GCP bugs since 7.2.15a so I suggest you plan the upgrade to latest 7.2 at your earliest convenience. 

kind regards
Bogdan Kecman

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".