MySQL Bugs: #37227: Node 3 killed this node because GCP stop was detected

Bug #37227	Node 3 killed this node because GCP stop was detected
Submitted:	5 Jun 2008 15:08	Modified:	7 Jun 2011 13:54
Reporter:	Mark Malakanov	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	mysql-5.1-telco-6.2	OS:	Linux (F8, 2.6.24.7-92.fc8 SMP x86_64)
Assigned to:		CPU Architecture:	Any
Tags:	GCP, mysql-5.1.23 ndb-6.2.15 -> ndb-6.4.0

Description:
One or another node fails after some period of production work. Usually after 12-24 hours of working.

Disk tables are used.

Following records appear in logs

ndb_3_error.log

Time: Thursday 5 June 2008 - 06:07:30
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal e
rror, programming error or missing error message, please report a bug)
Error: 2303
Error data: Node 3 killed this node because GCP stop was detected
Error object: NDBCNTR (Line: 235) 0x0000000a
Program: /usr/sbin/ndbd
Pid: 2140
Trace: /usr/local/mysql/data/ndb_3_trace.log.3
Version: mysql-5.1.23 ndb-6.2.15
***EOM***

ndb_3_out.log

2008-06-05 06:07:31 [ndbd] INFO     -- Node 3 killed this node because GCP stop
was detected
2008-06-05 06:07:31 [ndbd] INFO     -- NDBCNTR (Line: 235) 0x0000000a
2008-06-05 06:07:31 [ndbd] INFO     -- Error handler shutting down system
2008-06-05 06:07:32 [ndbd] INFO     -- Error handler shutdown completed - exitin
g
2008-06-05 06:07:32 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. C
aused by error 2303: 'System error, node killed during node restart by other nod
e(Internal error, programming error or missing error message, please report a bu
g). Temporary error, restart node'.

How to repeat:
it is hard to predict

ndb config

Attachment: my-ndb.cfg (text/plain), 4.34 KiB.

logs

Attachment: nb_3_logs.zip (application/x-zip-compressed, text), 357.08 KiB.

I meet the same question. hey, how to fixed it?

Hi,

We have found this at a customer site aswell, and suspect that it's ntp
causing problem.

Therefor we changed to using CLOCK_MONOTONIC, this patch is in 6.3.16
but we have made no new 6.2.x-release since.

Could any/both of you investigate if this could be a cause?

/Jonas

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

same problem on customer site:

server1: datanode1 + datanode3
server2: datanode2 + datanode4
server3: mgm node + api node
server4: mgm node + api node
server5-x: api node

restart server2

and datanode1 dies with:
Error: 2303
Error data: Node 1 killed this node because GCP stop was detected
Error object: NDBCNTR (Line: 235) 0x00000002

If you kill -9 all ndbd processes on server2 problem does not occur but shutdown -rf now get's it every time.

ALL servers have time synchronised (ntpd)

Different error, but similar effect..

Attachment: ndb_error_report_20081116135307.tar.bz2 (application/octet-stream, text), 49.33 KiB.

Similar effect, but different error, using MySQL Cluster 6.3.17:

Doing a `shutdown -rf now` on the machine running Data Node 3 and 5, they are losing heartbeats:

2008-11-16 13:27:11 [MgmSrvr] WARNING  -- Node 3: Node 1 missed heartbeat 2
2008-11-16 13:27:11 [MgmSrvr] WARNING  -- Node 5: Node 1 missed heartbeat 2
2008-11-16 13:27:12 [MgmSrvr] WARNING  -- Node 4: Node 3 missed heartbeat 2
2008-11-16 13:27:13 [MgmSrvr] WARNING  -- Node 4: Node 3 missed heartbeat 3

Shortly after:

2008-11-16 13:27:22 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2315: 'Node declared dead. See error log for details(Arbitration error). Temporary error, restart node'.
2008-11-16 13:27:48 [MgmSrvr] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.

Now, I dod the same thing just minutes laters (parts of log):

2008-11-16 13:42:58 [MgmSrvr] INFO     -- Node 3: Node shutdown completed. Initiated by signal 15.
2008-11-16 13:42:59 [MgmSrvr] INFO     -- Node 2: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 2 sig->failNo =
2008-11-16 13:42:59 [MgmSrvr] INFO     -- Node 4: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 2 sig->failNo =
2008-11-16 13:42:59 [MgmSrvr] INFO     -- Node 5: Node shutdown completed. Initiated by signal 15.

So, one time it has heartbeats failures, and it takes the other nodes with it. Later on signal 15 was recognized and others nodes didn't fail.

Logs attached.

Hi,

I am now hitting this pretty often in ACRT using NDBAtomics.

Current byte-offset of file-pointer is: 568                       

Time: Thursday 18 December 2008 - 03:10:55
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug)
Error: 2303
Error data: Node 3 killed this node because GCP stop was detected
Error object: NDBCNTR (Line: 255) 0x00000006
Program: /data0/cr_autotest/libexec/ndbd
Pid: 25606
Trace: ./ndb_3_trace.log.1
Version: mysql-5.1.30 ndb-6.4.0-alpha
***EOM***

TPCC-DD

result-2-dn-dd-mysql-5.1-telco-6.4/2008-12-24_01:41/result.1/ndbd.2/ndb_3_error.log

Time: Wednesday 24 December 2008 - 04:05:04
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug)
Error: 2303
Error data: Node 3 killed this node because GCP stop was detected
Error object: NDBCNTR (Line: 255) 0x00000006
Program: /data0/cr_autotest/libexec/ndbd
Pid: 14075
Trace: ./ndb_3_trace.log.1
Version: mysql-5.1.30 ndb-6.4.1-beta
***EOM***

TPCC-DD

result-2-dn-perf-mysql-5.1-telco-6.4/2008-12-24_02:12/result.6/ndbd.2/ndb_3_error.log

Time: Wednesday 24 December 2008 - 06:48:34
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug)
Error: 2303
Error data: Node 3 killed this node because GCP stop was detected
Error object: NDBCNTR (Line: 255) 0x00000006
Program: /data0/cr_autotest/libexec/ndbd
Pid: 16250
Trace: ./ndb_3_trace.log.1
Version: mysql-5.1.30 ndb-6.4.1-beta

File: result-2-dn-perf-mysql-5.1-telco-6.4/2008-12-24_02:12/result.6/ndb_api.1/log.out

+ command_exec '/data0/cr_autotest/bin/mysql  -h ndb18 -u root --port 15001 --protocol=tcp -e "SELECT LOGFILE_GROUP_NAME, LOGFILE_GROUP_NUMBER, EXTRA FROM INFORMATION_SCHEMA.FILES"'
+ '[' -n '' ']'
+ eval '/data0/cr_autotest/bin/mysql  -h ndb18 -u root --port 15001 --protocol=tcp -e "SELECT LOGFILE_GROUP_NAME, LOGFILE_GROUP_NUMBER, EXTRA FROM INFORMATION_SCHEMA.FILES"'
++ /data0/cr_autotest/bin/mysql -h ndb18 -u root --port 15001 --protocol=tcp -e 'SELECT LOGFILE_GROUP_NAME, LOGFILE_GROUP_NUMBER, EXTRA FROM INFORMATION_SCHEMA.FILES'
+ rc=0
+ '[' 0 -ne 0 ']'
+ command_exec '/data0/cr_autotest/bin/mysql  -h ndb18 -u root --port 15001 --protocol=tcp -e "CREATE LOGFILE GROUP lg1 ADD UNDOFILE '\''undofile.dat'\'' initial_size 512M ENGINE=NDB"'
+ '[' -n '' ']'
+ eval '/data0/cr_autotest/bin/mysql  -h ndb18 -u root --port 15001 --protocol=tcp -e "CREATE LOGFILE GROUP lg1 ADD UNDOFILE '\''undofile.dat'\'' initial_size 512M ENGINE=NDB"'
++ /data0/cr_autotest/bin/mysql -h ndb18 -u root --port 15001 --protocol=tcp -e 'CREATE LOGFILE GROUP lg1 ADD UNDOFILE '\''undofile.dat'\'' initial_size 512M ENGINE=NDB'
+ rc=0
+ '[' 0 -ne 0 ']'
+ command_exec '/data0/cr_autotest/bin/mysql  -h ndb18 -u root --port 15001 --protocol=tcp -e "CREATE TABLESPACE ts1 ADD DATAFILE '\''datafile.dat'\'' USE LOGFILE GROUP lg1 initial_size 3G ENGINE=NDB"'
+ '[' -n '' ']'
+ eval '/data0/cr_autotest/bin/mysql  -h ndb18 -u root --port 15001 --protocol=tcp -e "CREATE TABLESPACE ts1 ADD DATAFILE '\''datafile.dat'\'' USE LOGFILE GROUP lg1 initial_size 3G ENGINE=NDB"'
++ /data0/cr_autotest/bin/mysql -h ndb18 -u root --port 15001 --protocol=tcp -e 'CREATE TABLESPACE ts1 ADD DATAFILE '\''datafile.dat'\'' USE LOGFILE GROUP lg1 initial_size 3G ENGINE=NDB'
+ rc=0
+ '[' 0 -ne 0 ']'
+ '[' 1 -gt 1 ']'
+ '[' X '!=' X ']'
+ echo 'Creation of DISK DATA Complete'
+ exit 0
ERRORERROR 1297 (HY000) at line 1: Got temporary error 4025 'Node failure caused abort of transaction' from NDBCLUSTER
 1297 (HY000) at line 1: Got temporary error 4025 'Node failure caused abort of transaction' from NDBCLUSTER

ndbatomics-dd

 result-2-dn-dd-mysql-5.1-telco-6.4/2008-12-24_16:49/result.1/ndbd.1/ndb_2_error.log

2008-12-24 19:44:27 [MgmSrvr] ALERT    -- Node 3: Node 2 Disconnected
2008-12-24 19:44:27 [MgmSrvr] INFO     -- Node 3: Communication to Node 2 closed
2008-12-24 19:44:27 [MgmSrvr] ALERT    -- Node 3: Network partitioning - arbitration required
2008-12-24 19:44:27 [MgmSrvr] INFO     -- Node 3: President restarts arbitration thread [state=7]
2008-12-24 19:44:27 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-12-24 19:44:27 [MgmSrvr] ALERT    -- Node 3: Arbitration won - positive reply from node 1
2008-12-24 19:44:27 [MgmSrvr] INFO     -- Node 3: GCP Take over started
2008-12-24 19:44:27 [MgmSrvr] INFO     -- Node 3: Node 3 taking over as DICT master
2008-12-24 19:44:27 [MgmSrvr] INFO     -- Node 3: GCP Take over completed
2008-12-24 19:44:27 [MgmSrvr] INFO     -- Node 3: kk: 976/9 2 0
2008-12-24 19:44:27 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2008-12-24 19:44:27 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-12-24 19:44:27 [MgmSrvr] INFO     -- Node 3: Started arbitrator node 1 [ticket=52ea00024f0c15d4]
2008-12-24 19:44:28 [MgmSrvr] ALERT    -- Node 3: Node 5 Disconnected
2008-12-24 19:44:28 [MgmSrvr] INFO     -- Node 3: Communication to Node 5 closed
2008-12-24 19:44:28 [MgmSrvr] INFO     -- Mgmt server state: nodeid 5 freed, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000000000012.

Time: Wednesday 24 December 2008 - 19:44:27
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug)
Error: 2303
Error data: Node 2 killed this node because GCP stop was detected
Error object: NDBCNTR (Line: 255) 0x00000006
Program: /data0/cr_autotest/libexec/ndbd
Pid: 24187
Trace: ./ndb_2_trace.log.1
Version: mysql-5.1.30 ndb-6.4.1-beta
***EOM***

Current byte-offset of file-pointer is: 568                       

Time: Wednesday 7 January 2009 - 11:42:34
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug)
Error: 2303
Error data: Node 3 killed this node because GCP stop was detected
Error object: NDBCNTR (Line: 255) 0x00000006
Program: /data0/cr_autotest/libexec/ndbmtd
Pid: 22106
Trace: ./ndb_3_trace.log.1 ./ndb_3_trace.log.1_t1 ./ndb_3_trace.log.1_t2 ./ndb_3_trace.log

ndb_3_trace.log.1

DBTC    004430 
DBTC    000190 000222 004850 004850 004850 004850 004850 004850 
        004850 004850 004850 004850 004850 004850 004850 004850 
        004850 004853 
DBTC    004430 
DBTC    000190 000222 004850 004850 004850 004850 004850 004850 
        004850 004850 004873 
DBTC    004430 
DBTC    004430 
QMGR    003059 003197 
DBTC    004430 
DBTC    004430 016016 013076 
NDBCNTR 000210 000255 

--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 3, r.sigId: 114006583 gsn: 465 "DUMP_STATE_ORD" prio: 0
s.bn: 246 "DBDIH", s.proc: 2, s.sigId: 101194834 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00001b6e
--------------- Signal ----------------
r.bn: 252 "QMGR", r.proc: 3, r.sigId: 114006582 gsn: 3 "API_REGREQ" prio: 0
s.bn: 4002 "API", s.proc: 1, s.sigId: 0 length: 3 trace: 0 #sec: 0 fragInf: 0
 H'0fa20001 H'00060401 H'0005011e
--------------- Signal ----------------
r.bn: 245 "DBTC", r.proc: 3, r.sigId: 114006581 gsn: 164 "CONTINUEB" prio: 1
s.bn: 245 "DBTC", s.proc: 3, s.sigId: 114006580 length: 3 trace: 1 #sec: 0 fragInf: 0
 H'00000006 H'0000003b H'0003db5b
--------------- Signal ----------------
r.bn: 245 "DBTC", r.proc: 3, r.sigId: 114006580 gsn: 164 "CONTINUEB" prio: 1
s.bn: 245 "DBTC", s.proc: 3, s.sigId: 114006579 length: 3 trace: 1 #sec: 0 fragInf: 0
 H'00000006 H'0000003b H'0003da7f
--------------- Signal ----------------
r.bn: 245 "DBTC", r.proc: 3, r.sigId: 114006579 gsn: 164 "CONTINUEB" prio: 1
s.bn: 245 "DBTC", s.proc: 3, s.sigId: 114006577 length: 3 trace: 1 #sec: 0 fragInf: 0
 H'00000006 H'0000003b H'0003dcc5
--------------- Signal ----------------
r.bn: 245 "DBTC", r.proc: 3, r.sigId: 114006578 gsn: 160 "COMPLETECONF" prio: 1
s.bn: 247/1 "DBLQH", s.proc: 3, s.sigId: 336796444 length: 4 trace: 1 #sec: 0 fragInf: 0
 H'0003e002 H'00000003 H'0001f751 H'00300500
--------------- Signal ----------------

ndb_3_trace.log.1_t1

LGMAN   001440 001457 001461 001467 
TSMAN   001937 000454 
PGMAN   001600 000153 000153 
LGMAN   001797 

--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 883179 gsn: 761 "STOP_FOR_CRASH" prio: 0
s.bn: 0 "SYS", s.proc: 0, s.sigId: 0 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00000000
--------------- Signal ----------------
r.bn: 260 "LGMAN", r.proc: 3, r.sigId: 883178 gsn: 270 "FSWRITECONF" prio: 1
s.bn: 253 "NDBFS", s.proc: 3, s.sigId: 114006521 length: 1 trace: 0 #sec: 0 fragInf: 0
 UserPointer: -2145861632
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 883177 gsn: 409 "TIME_SIGNAL" prio: 1
s.bn: 252 "QMGR", s.proc: 3, s.sigId: 114006520 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00000004
--------------- Signal ----------------
r.bn: 260 "LGMAN", r.proc: 3, r.sigId: 883176 gsn: 164 "CONTINUEB" prio: 1
s.bn: 260 "LGMAN", s.proc: 3, s.sigId: 883175 length: 3 trace: 0 #sec: 0 fragInf: 0
 H'00000002 H'1890c000 H'00000000
--------------- Signal ----------------
r.bn: 260 "LGMAN", r.proc: 3, r.sigId: 883175 gsn: 164 "CONTINUEB" prio: 0
s.bn: 260 "LGMAN", s.proc: 3, s.sigId: 883173 length: 3 trace: 0 #sec: 0 fragInf: 0
 H'00000002 H'1890c000 H'00000001

ndb_3_trace.log.1_t2

DBLQH   002801 
DBTUP   010029 
DBLQH   002801 
DBTUP   010029 
DBLQH   002801 
DBTUP   010029 

--------------- Signal ----------------
r.bn: 247/1 "DBLQH", r.proc: 3, r.sigId: 336796598 gsn: 761 "STOP_FOR_CRASH" prio: 0
s.bn: 0 "SYS", s.proc: 0, s.sigId: 0 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00000000
--------------- Signal ----------------
r.bn: 248/1 "DBACC", r.proc: 3, r.sigId: 336796597 gsn: 253 "EXPANDCHECK2" prio: 1
s.bn: 248/1 "DBACC", s.proc: 3, s.sigId: 336796596 length: 3 trace: 1 #sec: 0 fragInf: 0
 H'00000016 H'00001ae2 H'00003fff
--------------- Signal ----------------
r.bn: 248/1 "DBACC", r.proc: 3, r.sigId: 336796596 gsn: 253 "EXPANDCHECK2" prio: 1
s.bn: 248/1 "DBACC", s.proc: 3, s.sigId: 336796595 length: 3 trace: 1 #sec: 0 fragInf: 0
 H'00000016 H'00001ae1 H'00003fff
--------------- Signal ----------------
r.bn: 248/1 "DBACC", r.proc: 3, r.sigId: 336796595 gsn: 253 "EXPANDCHECK2" prio: 1
s.bn: 248/1 "DBACC", s.proc: 3, s.sigId: 336796594 length: 3 trace: 1 #sec: 0 fragInf: 0
 H'00000016 H'00001ae0 H'00003fff
--------------- Signal ----------------
r.bn: 248/1 "DBACC", r.proc: 3, r.sigId: 336796594 gsn: 253 "EXPANDCHECK2" prio: 1
s.bn: 248/1 "DBACC", s.proc: 3, s.sigId: 336796593 length: 3 trace: 1 #sec: 0 fragInf: 0
 H'00000016 H'00001adf H'00003fff
--------------- Signal ----------------
r.bn: 248/1 "DBACC", r.proc: 3, r.sigId: 336796593 gsn: 253 "EXPANDCHECK2" prio: 1
s.bn: 248/1 "DBACC", s.proc: 3, s.sigId: 336796592 length: 3 trace: 1 #sec: 0 fragInf: 0
 H'00000016 H'00001ade H'00003fff
--------------- Signal ----------------
r.bn: 248/1 "DBACC", r.proc: 3, r.sigId: 336796592 gsn: 253 "EXPANDCHECK2" prio: 1
s.bn: 248/1 "DBACC", s.proc: 3, s.sigId: 336796527 length: 3 trace: 1 #sec: 0 fragInf: 0
 H'00000016 H'00001add H'00003fff
--------------- Signal ----------------
r.bn: 247/3 "DBLQH", r.proc: 3, r.sigId: 336796591 gsn: 152 "COMMIT" prio: 1
s.bn: 247/1 "DBLQH", s.proc: 2, s.sigId: 242794995 length: 5 trace: 1 #sec: 0 fragInf: 0
 H'0000011d H'00000a01 H'0001f780 H'00300500 H'00000001
--------------- Signal ----------------
r.bn: 247/3 "DBLQH", r.proc: 3, r.sigId: 336796590 gsn: 152 "COMMIT" prio: 1
s.bn: 247/1 "DBLQH", s.proc: 2, s.sigId: 242794994 length: 5 trace: 1 #sec: 0 fragInf: 0
 H'0000015e H'00000a01 H'0001f780 H'00300500 H'00000001

ndb_3_trace.log.1_t3

DBTUP   065233 065247 
DBACC   000081 000119 
DBLQH   002801 
DBTUP   010029 
DBLQH   002801 
DBTUP   010029 

--------------- Signal ----------------
r.bn: 247/2 "DBLQH", r.proc: 3, r.sigId: 286619747 gsn: 761 "STOP_FOR_CRASH" prio: 0
s.bn: 0 "SYS", s.proc: 0, s.sigId: 0 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00000000
--------------- Signal ----------------
r.bn: 248/2 "DBACC", r.proc: 3, r.sigId: 286619746 gsn: 164 "CONTINUEB" prio: 0
s.bn: 248/2 "DBACC", s.proc: 3, s.sigId: 285196946 length: 2 trace: 0 #sec: 0 fragInf: 0
 H'00000008 H'00000014
--------------- Signal ----------------
r.bn: 249/2 "DBTUP", r.proc: 3, r.sigId: 286619745 gsn: 164 "CONTINUEB" prio: 0
s.bn: 249/2 "DBTUP", s.proc: 3, s.sigId: 285196945 length: 2 trace: 0 #sec: 0 fragInf: 0
 H'00000008 H'00000014
--------------- Signal ----------------
r.bn: 248/2 "DBACC", r.proc: 3, r.sigId: 286619744 gsn: 253 "EXPANDCHECK2" prio: 1
s.bn: 248/2 "DBACC", s.proc: 3, s.sigId: 286619743 length: 3 trace: 1 #sec: 0 fragInf: 0
 H'00000016 H'00001ab7 H'00003fff
--------------- Signal ----------------
r.bn: 248/2 "DBACC", r.proc: 3, r.sigId: 286619743 gsn: 253 "EXPANDCHECK2" prio: 1
s.bn: 248/2 "DBACC", s.proc: 3, s.sigId: 286619742 length: 3 trace: 1 #sec: 0 fragInf: 0
 H'00000016 H'00001ab6 H'00003fff
--------------- Signal ----------------
r.bn: 248/2 "DBACC", r.proc: 3, r.sigId: 286619742 gsn: 253 "EXPANDCHECK2" prio: 1
s.bn: 248/2 "DBACC", s.proc: 3, s.sigId: 286619741 length: 3 trace: 1 #sec: 0 fragInf: 0
 H'00000016 H'00001ab5 H'00003fff
--------------- Signal ----------------
r.bn: 248/2 "DBACC", r.proc: 3, r.sigId: 286619741 gsn: 253 "EXPANDCHECK2" prio: 1
s.bn: 248/2 "DBACC", s.proc: 3, s.sigId: 286619740 length: 3 trace: 1 #sec: 0 fragInf: 0
 H'00000016 H'00001ab4 H'00003fff
--------------- Signal ----------------
r.bn: 248/2 "DBACC", r.proc: 3, r.sigId: 286619740 gsn: 253 "EXPANDCHECK2" prio: 1
s.bn: 248/2 "DBACC", s.proc: 3, s.sigId: 286619739 length: 3 trace: 1 #sec: 0 fragInf: 0
 H'00000016 H'00001ab3 H'00003fff
--------------- Signal ----------------

ndb_3_trace.log.1_t4

CMVMI   000238 
CMVMI   000238 
CMVMI   000238 
CMVMI   000238 
CMVMI   000238 
CMVMI   000238 
CMVMI   000238 

--------------- Signal ----------------
r.bn: 254 "CMVMI", r.proc: 3, r.sigId: 17526 gsn: 761 "STOP_FOR_CRASH" prio: 0
s.bn: 0 "SYS", s.proc: 0, s.sigId: 0 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00000000
--------------- Signal ----------------
r.bn: 254 "CMVMI", r.proc: 3, r.sigId: 17525 gsn: 247 "EVENT_REP" prio: 0
s.bn: 0 "SYS", s.proc: 3, s.sigId: 0 length: 3 trace: 0 #sec: 0 fragInf: 0
 H'00000029 H'00000002 H'000015b4
--------------- Signal ----------------
r.bn: 254 "CMVMI", r.proc: 3, r.sigId: 17524 gsn: 247 "EVENT_REP" prio: 1
s.bn: 0 "SYS", s.proc: 0, s.sigId: 0 length: 6 trace: 0 #sec: 0 fragInf: 0
 H'00000046 H'00000003 H'0000005a H'0000048f H'001e8427 H'00f42345
--------------- Signal ----------------
r.bn: 254 "CMVMI", r.proc: 3, r.sigId: 17523 gsn: 247 "EVENT_REP" prio: 1
s.bn: 0 "SYS", s.proc: 0, s.sigId: 0 length: 6 trace: 0 #sec: 0 fragInf: 0
 H'00000046 H'00000002 H'00000010 H'00000083 H'001e8471 H'00f42581
--------------- Signal ----------------

Current byte-offset of file-pointer is: 568                       

Time: Wednesday 14 January 2009 - 11:11:46
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug)
Error: 2303
Error data: Node 3 killed this node because GCP stop was detected
Error object: NDBCNTR (Line: 255) 0x00000006
Program: /data0/cr_autotest/libexec/ndbd
Pid: 29614
Trace: ./ndb_3_trace.log.1
Version: mysql-5.1.30 ndb-6.4.1-beta
***EOM***

Hitting GCP STOP is easier in 6.4 if you use less then 8 data file for Disk Data. This is due to the new readers thread parameter "ThreadPool". (default = 8). If you are hitting "GCP STOP" in 6.4, you should adjust the "ThreadPool" to be = to the # of disk data files you have.

Tomas asked me to look into this for discussion with Jonas and documenting known issue(s) - I've put bug in Documenting status (at least for the time being) so it'll show up in my work queue and I don't forget.

Added info about GCP stop issues to docs, will appear at http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndbd-definition.html#mysql-cluster-nd... and http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndbd-definition.html#mysql-cluster-nd...

Also documented ThreadPool parameter (which wasn't previously documented)

Commit to docs is here: http://lists.mysql.com/commits/65603

Docs work for now being finished, I've returned this bug to Verified status for further handling by NDB devs/Support/QA.

please also see http://bugs.mysql.com/bug.php?id=41292

see also http://bugs.mysql.com/bug.php?id=43069

http://bugs.mysql.com/bug.php?id=43882
http://bugs.mysql.com/bug.php?id=39498

HI,

I got the same error in the production today. I am not using any disk tables, all my data is in memeory, but even then the mysql-cluster crashed.

Please find below the part of error logs,

2010-09-23 19:46:51 [MgmtSrvr] INFO     -- Node 2: Local checkpoint 12005 started. Keep GCI = 313401 oldest restorable GCI = 313734
2010-09-23 19:59:55 [MgmtSrvr] INFO     -- Node 2: Local checkpoint 12005 completed
2010-09-23 19:59:56 [MgmtSrvr] INFO     -- Node 2: Local checkpoint 12006 started. Keep GCI = 313783 oldest restorable GCI = 314002
2010-09-23 20:00:02 [MgmtSrvr] ALERT    -- Node 2: Node 55 Disconnected
2010-09-23 20:00:02 [MgmtSrvr] INFO     -- Node 2: Communication to Node 55 closed
2010-09-23 20:00:03 [MgmtSrvr] INFO     -- Mgmt server state: nodeid 55 freed, m_reserved_nodes 1, 53, 54, 56, 57 and 58.
2010-09-23 20:00:04 [MgmtSrvr] ALERT    -- Node 2: Node 58 Disconnected
2010-09-23 20:00:04 [MgmtSrvr] INFO     -- Node 2: Communication to Node 58 closed
2010-09-23 20:00:04 [MgmtSrvr] ALERT    -- Node 2: Node 57 Disconnected
2010-09-23 20:00:04 [MgmtSrvr] INFO     -- Node 2: Communication to Node 57 closed
2010-09-23 20:00:04 [MgmtSrvr] INFO     -- Mgmt server state: nodeid 58 freed, m_reserved_nodes 1, 53, 54, 56 and 57.
2010-09-23 20:00:04 [MgmtSrvr] INFO     -- Mgmt server state: nodeid 57 freed, m_reserved_nodes 1, 53, 54 and 56.
2010-09-23 20:00:06 [MgmtSrvr] INFO     -- Node 2: Communication to Node 55 opened
2010-09-23 20:00:07 [MgmtSrvr] INFO     -- Node 2: Communication to Node 58 opened
2010-09-23 20:00:07 [MgmtSrvr] ALERT    -- Node 2: Node 56 Disconnected
2010-09-23 20:00:07 [MgmtSrvr] INFO     -- Node 2: Communication to Node 56 closed
2010-09-23 20:00:08 [MgmtSrvr] INFO     -- Node 2: Communication to Node 57 opened
2010-09-23 20:00:08 [MgmtSrvr] INFO     -- Mgmt server state: nodeid 56 freed, m_reserved_nodes 1, 53 and 54.
2010-09-23 20:00:11 [MgmtSrvr] INFO     -- Node 2: Communication to Node 56 opened
2010-09-23 20:13:00 [MgmtSrvr] INFO     -- Node 2: Local checkpoint 12006 completed
2010-09-23 20:13:02 [MgmtSrvr] INFO     -- Node 2: Local checkpoint 12007 started. Keep GCI = 314165 oldest restorable GCI = 314166
2010-09-23 20:26:06 [MgmtSrvr] INFO     -- Node 2: Local checkpoint 12007 completed
2010-09-23 20:26:07 [MgmtSrvr] INFO     -- Node 2: Local checkpoint 12008 started. Keep GCI = 314550 oldest restorable GCI = 314171
2010-09-23 20:30:40 [MgmtSrvr] ALERT    -- Node 2: Node 54 Disconnected
2010-09-23 20:30:40 [MgmtSrvr] ALERT    -- Node 2: Node 53 Disconnected
2010-09-23 20:30:46 [MgmtSrvr] ALERT    -- Node 1: Node 2 Disconnected
2010-09-23 20:30:51 [MgmtSrvr] ALERT    -- Node 1: Node 2 Disconnected
2010-09-23 20:31:09 [MgmtSrvr] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2010-09-23 21:32:09 [MgmtSrvr] INFO     -- Mgmt server state: nodeid 54 freed, m_reserved_nodes 1 and 53.

Since my data was in memory, its all lost, before i could take the dump of it.

What could be the possible reason for this or is it related to this bug only.

Thanks,
Manu

Please try using version 7.1.10 or later, there have been added some new features relating to GCP Stop detection that should avoid these in the future.

http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-news-5-1-51-ndb-7-1-10.html

Important Change: The following changes have been made with regard to the TimeBetweenEpochsTimeout data node configuration parameter:

* The maximum possible value for this parameter has been increased from 32000 milliseconds to 256000 milliseconds.
* Setting this parameter to zero now has the effect of disabling GCP stops caused by save timeouts, commit timeouts, or both.
* The current value of this parameter and a warning are written to the cluster log whenever a GCP save takes longer than 1 minute or a GCP commit takes longer than 10 seconds.

# Made it possible to exercise more direct control over handling of timeouts occurring when trying to flush redo logs to disk using two new data node configuration parameters RedoOverCommitCounter and RedoOverCommitLimit, as well as the new API node configuration parameter DefaultOperationRedoProblemAction, all added in this release. Now, when such timeouts occur more than a specified number of times for the flushing of a given redo log, any transactions that were to be written are instead aborted, and the operations contained in those transactions can be either re-tried or themselves aborted.

For more information, see Redo log over-commit handling.

http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndbd-definition.html#mysql-cluster-re...

Duplicate of Bug #58383