Bug #19179 Stress: Loading 3 databases at ome time cause cluster to crash
Submitted: 18 Apr 2006 21:38 Modified: 21 Apr 2006 4:46
Reporter: Jonathan Miller Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:5.1.10 OS:Linux (Linux 32 Bit OS)
Assigned to: CPU Architecture:Any

[18 Apr 2006 21:38] Jonathan Miller
Description:
Hi! I was loading TPC-B and TPC-C and atae all at the same time. Received the following from the programs that were loading:

DBD::mysql::st execute failed: Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER at load_tpcb.pl line 580.
insert into account Error: Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER at load_tpcb.pl line 580.

 Error in ndbtest.cpp, line: 1661, code: 4010, msg: Node failure caused abort of transaction.

ERROR 1296 (HY000) at line 1: Got error 4116 'Operation was not defined correctly, probably missing a key' from NDBCLUSTER
ERROR: rc=1
SCRIPT INTERRUPTED

Error Logs:

Time: Tuesday 18 April 2006 - 23:21:56
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug)
Error: 2303
Error data: Node 2 killed this node because GCP stop was detected
Error object: NDBCNTR (Line: 229) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 21845
Trace: /space/run/ndb_2_trace.log.1
Version: Version 5.1.10 (beta)

--------------- Signal ----------------
r.bn: 251 "NDBCNTR", r.proc: 2, r.sigId: 14652818 gsn: 395 "SYSTEM_ERROR" prio: 0
s.bn: 246 "DBDIH", s.proc: 2, s.sigId: 14652816 length: 4 trace: 0 #sec: 0 fragInf: 0
errorRef: H'00f60002
errorCode: 3
data1: H'00000002
data2: H'00000000
--------------- Signal ----------------
r.bn: 260 "LGMAN", r.proc: 2, r.sigId: 14652817 gsn: 164 "CONTINUEB" prio: 0
s.bn: 260 "LGMAN", s.proc: 2, s.sigId: 14652815 length: 3 trace: 0 #sec: 0 fragInf: 0
 H'00000002 H'00004000 H'00000001
--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 2, r.sigId: 14652816 gsn: 164 "CONTINUEB" prio: 0
s.bn: 246 "DBDIH", s.proc: 2, s.sigId: 14652814 length: 6 trace: 0 #sec: 0 fragInf: 0
 Check GCP Stop
coldGcpStatus   = 2
cgcpStatus      = 2
coldGcpId       = 1409
cnewgcp         = 1409
cgcpSameCounter = 1199
--------------- Signal ----------------
r.bn: 249 "DBTUP", r.proc: 2, r.sigId: 14652813 gsn: 332 "NEXT_SCANREQ" prio: 1
s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 14652812 length: 3 trace: 0 #sec: 0 fragInf: 0
 H'00000001 H'ffffffff H'00000002
--------------- Signal ----------------

Time: Tuesday 18 April 2006 - 23:23:19
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug)
Error: 2303
Error data: Nodefailure while out of event buffer
Error object: SUMA (Line: 4686) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 22159
Trace: /space/run/ndb_3_trace.log.1
Version: Version 5.1.10 (beta)

--------------- Signal ----------------
r.bn: 257 "SUMA", r.proc: 3, r.sigId: 19602614 gsn: 26 "NODE_FAILREP" prio: 1
s.bn: 251 "NDBCNTR", s.proc: 3, s.sigId: 19602608 length: 5 trace: 8 #sec: 0 fragInf: 0
 H'00000002 H'00000003 H'00000001 H'00000004 H'00000000
--------------- Signal ----------------
r.bn: 244 "BACKUP", r.proc: 3, r.sigId: 19602613 gsn: 26 "NODE_FAILREP" prio: 1
s.bn: 251 "NDBCNTR", s.proc: 3, s.sigId: 19602608 length: 5 trace: 8 #sec: 0 fragInf: 0
 H'00000002 H'00000003 H'00000001 H'00000004 H'00000000
--------------- Signal ----------------
r.bn: 250 "DBDICT", r.proc: 3, r.sigId: 19602612 gsn: 26 "NODE_FAILREP" prio: 1
s.bn: 251 "NDBCNTR", s.proc: 3, s.sigId: 19602608 length: 5 trace: 8 #sec: 0 fragInf: 0
 H'00000002 H'00000003 H'00000001 H'00000004 H'00000000
--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 3, r.sigId: 19602611 gsn: 26 "NODE_FAILREP" prio: 1
s.bn: 251 "NDBCNTR", s.proc: 3, s.sigId: 19602608 length: 5 trace: 8 #sec: 0 fragInf: 0
 H'00000002 H'00000003 H'00000001 H'00000004 H'00000000
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 19602610 gsn: 26 "NODE_FAILREP" prio: 1
s.bn: 251 "NDBCNTR", s.proc: 3, s.sigId: 19602608 length: 5 trace: 8 #sec: 0 fragInf: 0
 H'00000002 H'00000003 H'00000001 H'00000004 H'00000000
--------------- Signal ----------------
r.bn: 245 "DBTC", r.proc: 3, r.sigId: 19602609 gsn: 26 "NODE_FAILREP" prio: 1
s.bn: 251 "NDBCNTR", s.proc: 3, s.sigId: 19602608 length: 5 trace: 8 #sec: 0 fragInf: 0
 H'00000002 H'00000003 H'00000001 H'00000004 H'00000000
--------------- Signal ----------------
r.bn: 251 "NDBCNTR", r.proc: 3, r.sigId: 19602608 gsn: 26 "NODE_FAILREP" prio: 1
s.bn: 252 "QMGR", s.proc: 3, s.sigId: 19602604 length: 5 trace: 8 #sec: 0 fragInf: 0
 H'00000002 H'00000003 H'00000001 H'00000004 H'00000000
--------------- Signal ----------------
r.bn: 254 "CMVMI", r.proc: 3, r.sigId: 19602607 gsn: 247 "EVENT_REP" prio: 1
s.bn: 252 "QMGR", s.proc: 3, s.sigId: 19602603 length: 7 trace: 8 #sec: 0 fragInf: 0
 H'0000001d H'0006000b H'00000001 H'55550001 H'aeac3719 H'00000000 H'00000000
--------------- Signal ----------------
r.bn: 254 "CMVMI", r.proc: 3, r.sigId: 19602606 gsn: 247 "EVENT_REP" prio: 1
s.bn: 252 "QMGR", s.proc: 3, s.sigId: 19602603 length: 7 trace: 8 #sec: 0 fragInf: 0
 H'0000001e H'0006002b H'00000001 H'55550001 H'aeac3719 H'00000000 H'00000000
--------------- Signal ----------------
r.bn: 252 "QMGR", r.proc: 3, r.sigId: 19602605 gsn: 153 "COMMIT_FAILCONF" prio: 0
s.bn: 252 "QMGR", s.proc: 3, s.sigId: 19602604 length: 1 trace: 8 #sec: 0 fragInf: 0
 H'00000003
--------------- Signal ----------------
r.bn: 252 "QMGR", r.proc: 3, r.sigId: 19602604 gsn: 154 "COMMIT_FAILREQ" prio: 0
s.bn: 252 "QMGR", s.proc: 3, s.sigId: 19602603 length: 2 trace: 8 #sec: 0 fragInf: 0
 H'00fc0003 H'00000002
--------------- Signal ----------------
r.bn: 252 "QMGR", r.proc: 3, r.sigId: 19602603 gsn: 343 "PREP_FAILCONF" prio: 0
s.bn: 252 "QMGR", s.proc: 4, s.sigId: 8740999 length: 2 trace: 8 #sec: 0 fragInf: 0
 H'00000004 H'00000002
--------------- Signal ----------------

How to repeat:
Setup:

Cluster Config:
Host:07 - MySQLD & ndb_mgmd
Host:08 - MySQLD
Host:09 - MySQLD
Host:13 - NDBD & NDBD
Host:14 - NDBD & NDBD

run:
$> perl load_tpcb.pl -e ndbmix -ho {host name (Host:07)}
$> ./mysql_load_db.sh -d dbt2 -f /space/var -e NDB -h {host name(Host:08)} -u root -m ORIG

$> time ./ndbtest \
  -mysqlUser=root \
  -mysqldSocket=/tmp/mysql3306.sock \
  -database=atae \
  -ndbConnectString='host=host:port \
  -valueFile=dbcolumns2segs.txt \
  -#traceValueFile \
  -#insertStartRecordNum=0 -insertNumRecords=50000000 \
  -insertStartRecordNum=0 -insertNumRecords=5000000 \
  -maxOperationsPerTxn=10000
[18 Apr 2006 21:41] Jonathan Miller
(0%) max=0B apply_gci=647 latest_gci=1405
2006-04-18 23:24:21 [MgmSrvr] WARNING  -- Node 4: Transporter to node 7 reported error 0x16
2006-04-18 23:24:21 [MgmSrvr] WARNING  -- Node 5: Transporter to node 6 reported error 0x16
2006-04-18 23:24:21 [MgmSrvr] WARNING  -- Node 5: Transporter to node 6 reported error 0x16 - Repeated 2 times
2006-04-18 23:24:22 [MgmSrvr] WARNING  -- Node 4: Transporter to node 7 reported error 0x16
2006-04-18 23:24:24 [MgmSrvr] WARNING  -- Node 4: Transporter to node 7 reported error 0x16 - Repeated 22 times
2006-04-18 23:24:25 [MgmSrvr] INFO     -- Node 8: Event buffer status: used=42KB(100%) alloc=42KB(0%) max=0B apply_gci=647 latest_gci=1406
2006-04-18 23:24:25 [MgmSrvr] WARNING  -- Node 4: Transporter to node 7 reported error 0x16
2006-04-18 23:24:28 [MgmSrvr] WARNING  -- Node 4: Transporter to node 7 reported error 0x16 - Repeated 18 times
2006-04-18 23:24:31 [MgmSrvr] WARNING  -- Node 4: Transporter to node 7 reported error 0x16 - Repeated 19 times
2006-04-18 23:24:33 [MgmSrvr] WARNING  -- Node 4: Transporter to node 7 reported error 0x16 - Repeated 13 times
2006-04-18 23:24:33 [MgmSrvr] INFO     -- Node 2: Local checkpoint 31 started. Keep GCI = 1358 oldest restorable GCI = 1360
2006-04-18 23:24:33 [MgmSrvr] INFO     -- Node 8: Event buffer status: used=42KB(100%) alloc=42KB(0%) max=0B apply_gci=647 latest_gci=1407
2006-04-18 23:24:33 [MgmSrvr] WARNING  -- Node 4: Transporter to node 7 reported error 0x16
2006-04-18 23:24:34 [MgmSrvr] WARNING  -- Node 4: Transporter to node 7 reported error 0x16 - Repeated 11 times
2006-04-18 23:24:35 [MgmSrvr] INFO     -- Node 7: Event buffer status: used=33MB(100%) alloc=33MB(0%) max=0B apply_gci=1402 latest_gci=1403
2006-04-18 23:24:35 [MgmSrvr] WARNING  -- Node 4: Transporter to node 7 reported error 0x16
2006-04-18 23:24:35 [MgmSrvr] WARNING  -- Node 4: Transporter to node 7 reported error 0x16 - Repeated 5 times
2006-04-18 23:24:35 [MgmSrvr] INFO     -- Node 6: Event buffer status: used=45MB(7%) alloc=45MB(0%) max=0B apply_gci=1401 latest_gci=1404
2006-04-18 23:24:36 [MgmSrvr] WARNING  -- Node 4: Transporter to node 7 reported error 0x16
2006-04-18 23:24:36 [MgmSrvr] INFO     -- Node 6: Event buffer status: used=45MB(8%) alloc=45MB.....
.
.
.
.
[18 Apr 2006 21:42] Jonathan Miller
2006-04-18 23:26:40 [MgmSrvr] WARNING  -- Node 3: Node 7 missed heartbeat 4
2006-04-18 23:26:40 [MgmSrvr] ALERT    -- Node 3: Node 7 declared dead due to missed heartbeat
2006-04-18 23:26:40 [MgmSrvr] INFO     -- Node 3: Communication to Node 7 closed
2006-04-18 23:26:40 [MgmSrvr] ALERT    -- Node 3: Node 7 Disconnected
2006-04-18 23:26:41 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2006-04-18 23:26:41 [MgmSrvr] ALERT    -- Node 3: Node 2 Disconnected
2006-04-18 23:26:41 [MgmSrvr] INFO     -- Node 3: Communication to Node 2 closed
2006-04-18 23:26:41 [MgmSrvr] ALERT    -- Node 4: Node 2 Disconnected
2006-04-18 23:26:41 [MgmSrvr] INFO     -- Node 4: Communication to Node 2 closed
2006-04-18 23:26:41 [MgmSrvr] ALERT    -- Node 5: Node 2 Disconnected
2006-04-18 23:26:41 [MgmSrvr] INFO     -- Node 5: Communication to Node 2 closed
2006-04-18 23:26:42 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Initiated by signal 0. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart no
2006-04-18 23:26:42 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2006-04-18 23:26:42 [MgmSrvr] ALERT    -- Node 4: Node 3 Disconnected
2006-04-18 23:26:42 [MgmSrvr] INFO     -- Node 4: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 2 sig->failNo =
2006-04-18 23:26:42 [MgmSrvr] INFO     -- Node 4: Communication to Node 2 closed
2006-04-18 23:26:42 [MgmSrvr] INFO     -- Node 4: Communication to Node 3 closed
2006-04-18 23:26:42 [MgmSrvr] ALERT    -- Node 5: Node 3 Disconnected
2006-04-18 23:26:42 [MgmSrvr] INFO     -- Node 5: Communication to Node 3 closed
2006-04-18 23:26:42 [MgmSrvr] INFO     -- Node 5: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 2 sig->failNo =
2006-04-18 23:26:42 [MgmSrvr] INFO     -- Node 5: Communication to Node 2 closed
2006-04-18 23:26:42 [MgmSrvr] INFO     -- Node 5: Communication to Node 3 closed
2006-04-18 23:26:42 [MgmSrvr] ALERT    -- Node 3: Forced node shutdown completed. Initiated by signal 0. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart no
2006-04-18 23:26:43 [MgmSrvr] INFO     -- Node 1: Node 4 Connected
2006-04-18 23:26:43 [MgmSrvr] ALERT    -- Node 4: Forced node shutdown completed. Initiated by signal 0. Caused by error 2305: 'Arbitrator shutdown, please investigate error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
2006-04-18 23:26:44 [MgmSrvr] INFO     -- Node 1: Node 5 Connected
2006-04-18 23:26:44 [MgmSrvr] INFO     -- Mgmt server state: nodeid 9 freed, m_reserved_nodes 00000000000001c2.
2006-04-18 23:26:44 [MgmSrvr] ALERT    -- Node 5: Forced node shutdown completed. Initiated by signal 0. Caused by error 2305: 'Arbitrator shutdown, please investigate error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
[18 Apr 2006 21:46] Jonathan Miller
Trying to recover:
Node 4: Forced node shutdown completed. Occured during startphase 4. Initiated by signal 0. Caused by error 2815: 'File not found(Ndbd file system inconsistency error, please report a bug). Ndbd file system error, restart node initial'.
Node 2: Forced node shutdown completed. Occured during startphase 4. Initiated by signal 0. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.
Node 3: Forced node shutdown completed. Occured during startphase 4. Initiated by signal 0. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.
Node 5: Forced node shutdown completed. Occured during startphase 4. Initiated by signal 0. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.
[18 Apr 2006 21:47] Jonathan Miller
Time: Tuesday 18 April 2006 - 23:40:57
Status: Ndbd file system error, restart node initial
Message: File not found (Ndbd file system inconsistency error, please report a bug)
Error: 2815
Error data: DBLQH: File system close failed. OS errno: 4294967295
Error object: DBLQH (Line: 1821) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 22413
Trace: /space/run/ndb_4_trace.log.2
Version: Version 5.1.10 (beta)
***EOM***
"ndb_4_error.log" 27L, 1067C                                                   1,1           Top
[18 Apr 2006 21:48] Jonathan Miller
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 4, r.sigId: 1434885 gsn: 270 "FSWRITECONF" prio: 1
s.bn: 253 "NDBFS", s.proc: 4, s.sigId: 1434884 length: 1 trace: 0 #sec: 0 fragInf: 0
 UserPointer: 2
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 4, r.sigId: 1434884 gsn: 164 "CONTINUEB" prio: 1
s.bn: 253 "NDBFS", s.proc: 4, s.sigId: 1434880 length: 1 trace: 0 #sec: 0 fragInf: 0
 Scanning the memory channel again with no delay
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 4, r.sigId: 1434883 gsn: 257 "FSCLOSEREQ" prio: 0
s.bn: 247 "DBLQH", s.proc: 4, s.sigId: 1434882 length: 5 trace: 0 #sec: 0 fragInf: 0
 UserPointer: 155
 FilePointer: 672
 UserReference: H'00f70004
 Flags: H'00000000, Don't remove file
 H'000002a0 H'00f70004 H'0000009b H'00000000 H'00003ab0
--------------- Signal ----------------
[19 Apr 2006 6:53] Jonas Oreland
Can you get it repeatable on your hw ?

If no, then this is a duplicate of #13744 which has been open forever and we never manage
         to reproduce it.

If yes, then it's very interesting...
[21 Apr 2006 4:46] Jonas Oreland
Duplicate of bug#13744