MySQL Bugs: #55473: Cluster Crash with "Forced node shutdown completed. Caused by error 2303"

Bug #55473	Cluster Crash with "Forced node shutdown completed. Caused by error 2303"
Submitted:	22 Jul 2010 8:04	Modified:	3 May 2012 6:32
Reporter:	Christian Ehmig	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	5.1.44-ndb-7.1.4b	OS:	Linux (Ubuntu Kernel 2.6.32)
Assigned to:		CPU Architecture:	Any
Tags:	crash, error 2302, ndbdmtd

Description:
While benchmarking a new cluster setup (4 multi threaded data nodes, 4 api nodes) with massive inserts / updates, we encountered a sudden crash of the whole cluster. We have inserted and updated around 500.000.000 rows until the crash occured.

How to repeat:
Currently we cannot provide anything, how to repeat the crash - but we will answer any question which may help tracking the bug. All necessary trace & crash files are attached.

set operating system

ndb_mgm and api node config files

Attachment: bug-data-55473-config.tgz (application/x-compressed, text), 3.94 KiB.

ndb log files (data node 1)

Attachment: bug-data-55473-ndb-logs.tgz (application/x-compressed, text), 281.38 KiB.

excerpt from the ndb_out log (complete log attached):

sendbufferpool waiting for lock, contentions: 665000 spins: 26060765
sendbufferpool waiting for lock, contentions: 670000 spins: 26280750
sendbufferpool waiting for lock, contentions: 675000 spins: 26498671
sendbufferpool waiting for lock, contentions: 680000 spins: 26685353
sendbufferpool waiting for lock, contentions: 685000 spins: 26828572
send lock node 3 waiting for lock, contentions: 115000 spins: 320219680
sendbufferpool waiting for lock, contentions: 690000 spins: 26984735
sendbufferpool waiting for lock, contentions: 695000 spins: 27174093
sendbufferpool waiting for lock, contentions: 700000 spins: 27307804
sendbufferpool waiting for lock, contentions: 705000 spins: 27514391
c_nodeStartMaster.blockGcp: 0 4294967040
m_gcp_save.m_counter: 1210 m_gcp_save.m_max_lag: 1210
m_micro_gcp.m_counter: 1 m_micro_gcp.m_max_lag: 101
m_gcp_save.m_state: 1
m_gcp_save.m_master.m_state: 1
m_micro_gcp.m_state: 0
m_micro_gcp.m_master.m_state: 0
c_COPY_GCIREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_COPY_TABREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_CREATE_FRAGREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_DIH_SWITCH_REPLICA_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_EMPTY_LCP_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_GCP_COMMIT_Counter = [SignalCounter: m_count=0 0000000000000000]
c_GCP_PREPARE_Counter = [SignalCounter: m_count=0 0000000000000000]
c_GCP_SAVEREQ_Counter = [SignalCounter: m_count=2 000000000000000c]
c_INCL_NODEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_MASTER_GCPREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_MASTER_LCPREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_START_INFOREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_START_RECREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_STOP_ME_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_TC_CLOPSIZEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_TCGETOPSIZEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
m_copyReason: 0 m_waiting: 0 0
c_copyGCISlave: sender{Data, Ref} 2 f60002 reason: 0 nextWord: 0
Detected GCP stop(1)...sending kill to [SignalCounter: m_count=2 000000000000000c]
LP 0 blockInstance: 2 partNo: 1 state: 0 WW_Gci: 1 gcprec: -256 flq: -256 currfile: 31 tailFileNo: 21 logTailMbyte: 78
LP 0 blockInstance: 1 partNo: 0 state: 0 WW_Gci: 1 gcprec: -256 flq: -256 currfile: 32 tailFileNo: 21 logTailMbyte: 118  file 0(0)  FileChangeState: 0  logFileStatus: 20  currentMbyte: 255  currentFilepage 8191

From looking at the ndb out log it seems as if you've committed a massive transaction. This can cause problem as committing takes time, and in case of large transactions touching many rows it can take more time than the cluster is willing to wait before starting the next global checkpoint (to not loose too much durability) ...

Can you confirm that your transactions are rather large (e.g. > 10.000 rows)?

I can confirm that we have rather large transactions, but only around 500 rows per transaction max (bulk inserts). Besides that, we use many threads for writing (around 150 avg).

This bug only occurs with the cluster binaries - I have recently built mysql-cluster for our target and everything seems to work fine so far, so the issue might be related to the binaries.

Duplicate of bug #37227 and bug #58383