| Bug #19179 | Stress: Loading 3 databases at ome time cause cluster to crash | ||
|---|---|---|---|
| Submitted: | 18 Apr 2006 21:38 | Modified: | 21 Apr 2006 4:46 |
| Reporter: | Jonathan Miller | Email Updates: | |
| Status: | Duplicate | Impact on me: | |
| Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S1 (Critical) |
| Version: | 5.1.10 | OS: | Linux (Linux 32 Bit OS) |
| Assigned to: | CPU Architecture: | Any | |
[18 Apr 2006 21:41]
Jonathan Miller
(0%) max=0B apply_gci=647 latest_gci=1405 2006-04-18 23:24:21 [MgmSrvr] WARNING -- Node 4: Transporter to node 7 reported error 0x16 2006-04-18 23:24:21 [MgmSrvr] WARNING -- Node 5: Transporter to node 6 reported error 0x16 2006-04-18 23:24:21 [MgmSrvr] WARNING -- Node 5: Transporter to node 6 reported error 0x16 - Repeated 2 times 2006-04-18 23:24:22 [MgmSrvr] WARNING -- Node 4: Transporter to node 7 reported error 0x16 2006-04-18 23:24:24 [MgmSrvr] WARNING -- Node 4: Transporter to node 7 reported error 0x16 - Repeated 22 times 2006-04-18 23:24:25 [MgmSrvr] INFO -- Node 8: Event buffer status: used=42KB(100%) alloc=42KB(0%) max=0B apply_gci=647 latest_gci=1406 2006-04-18 23:24:25 [MgmSrvr] WARNING -- Node 4: Transporter to node 7 reported error 0x16 2006-04-18 23:24:28 [MgmSrvr] WARNING -- Node 4: Transporter to node 7 reported error 0x16 - Repeated 18 times 2006-04-18 23:24:31 [MgmSrvr] WARNING -- Node 4: Transporter to node 7 reported error 0x16 - Repeated 19 times 2006-04-18 23:24:33 [MgmSrvr] WARNING -- Node 4: Transporter to node 7 reported error 0x16 - Repeated 13 times 2006-04-18 23:24:33 [MgmSrvr] INFO -- Node 2: Local checkpoint 31 started. Keep GCI = 1358 oldest restorable GCI = 1360 2006-04-18 23:24:33 [MgmSrvr] INFO -- Node 8: Event buffer status: used=42KB(100%) alloc=42KB(0%) max=0B apply_gci=647 latest_gci=1407 2006-04-18 23:24:33 [MgmSrvr] WARNING -- Node 4: Transporter to node 7 reported error 0x16 2006-04-18 23:24:34 [MgmSrvr] WARNING -- Node 4: Transporter to node 7 reported error 0x16 - Repeated 11 times 2006-04-18 23:24:35 [MgmSrvr] INFO -- Node 7: Event buffer status: used=33MB(100%) alloc=33MB(0%) max=0B apply_gci=1402 latest_gci=1403 2006-04-18 23:24:35 [MgmSrvr] WARNING -- Node 4: Transporter to node 7 reported error 0x16 2006-04-18 23:24:35 [MgmSrvr] WARNING -- Node 4: Transporter to node 7 reported error 0x16 - Repeated 5 times 2006-04-18 23:24:35 [MgmSrvr] INFO -- Node 6: Event buffer status: used=45MB(7%) alloc=45MB(0%) max=0B apply_gci=1401 latest_gci=1404 2006-04-18 23:24:36 [MgmSrvr] WARNING -- Node 4: Transporter to node 7 reported error 0x16 2006-04-18 23:24:36 [MgmSrvr] INFO -- Node 6: Event buffer status: used=45MB(8%) alloc=45MB..... . . . .
[18 Apr 2006 21:42]
Jonathan Miller
2006-04-18 23:26:40 [MgmSrvr] WARNING -- Node 3: Node 7 missed heartbeat 4 2006-04-18 23:26:40 [MgmSrvr] ALERT -- Node 3: Node 7 declared dead due to missed heartbeat 2006-04-18 23:26:40 [MgmSrvr] INFO -- Node 3: Communication to Node 7 closed 2006-04-18 23:26:40 [MgmSrvr] ALERT -- Node 3: Node 7 Disconnected 2006-04-18 23:26:41 [MgmSrvr] INFO -- Node 1: Node 2 Connected 2006-04-18 23:26:41 [MgmSrvr] ALERT -- Node 3: Node 2 Disconnected 2006-04-18 23:26:41 [MgmSrvr] INFO -- Node 3: Communication to Node 2 closed 2006-04-18 23:26:41 [MgmSrvr] ALERT -- Node 4: Node 2 Disconnected 2006-04-18 23:26:41 [MgmSrvr] INFO -- Node 4: Communication to Node 2 closed 2006-04-18 23:26:41 [MgmSrvr] ALERT -- Node 5: Node 2 Disconnected 2006-04-18 23:26:41 [MgmSrvr] INFO -- Node 5: Communication to Node 2 closed 2006-04-18 23:26:42 [MgmSrvr] ALERT -- Node 2: Forced node shutdown completed. Initiated by signal 0. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart no 2006-04-18 23:26:42 [MgmSrvr] INFO -- Node 1: Node 3 Connected 2006-04-18 23:26:42 [MgmSrvr] ALERT -- Node 4: Node 3 Disconnected 2006-04-18 23:26:42 [MgmSrvr] INFO -- Node 4: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 2 sig->failNo = 2006-04-18 23:26:42 [MgmSrvr] INFO -- Node 4: Communication to Node 2 closed 2006-04-18 23:26:42 [MgmSrvr] INFO -- Node 4: Communication to Node 3 closed 2006-04-18 23:26:42 [MgmSrvr] ALERT -- Node 5: Node 3 Disconnected 2006-04-18 23:26:42 [MgmSrvr] INFO -- Node 5: Communication to Node 3 closed 2006-04-18 23:26:42 [MgmSrvr] INFO -- Node 5: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 2 sig->failNo = 2006-04-18 23:26:42 [MgmSrvr] INFO -- Node 5: Communication to Node 2 closed 2006-04-18 23:26:42 [MgmSrvr] INFO -- Node 5: Communication to Node 3 closed 2006-04-18 23:26:42 [MgmSrvr] ALERT -- Node 3: Forced node shutdown completed. Initiated by signal 0. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart no 2006-04-18 23:26:43 [MgmSrvr] INFO -- Node 1: Node 4 Connected 2006-04-18 23:26:43 [MgmSrvr] ALERT -- Node 4: Forced node shutdown completed. Initiated by signal 0. Caused by error 2305: 'Arbitrator shutdown, please investigate error(s) on other node(s)(Arbitration error). Temporary error, restart node'. 2006-04-18 23:26:44 [MgmSrvr] INFO -- Node 1: Node 5 Connected 2006-04-18 23:26:44 [MgmSrvr] INFO -- Mgmt server state: nodeid 9 freed, m_reserved_nodes 00000000000001c2. 2006-04-18 23:26:44 [MgmSrvr] ALERT -- Node 5: Forced node shutdown completed. Initiated by signal 0. Caused by error 2305: 'Arbitrator shutdown, please investigate error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
[18 Apr 2006 21:46]
Jonathan Miller
Trying to recover: Node 4: Forced node shutdown completed. Occured during startphase 4. Initiated by signal 0. Caused by error 2815: 'File not found(Ndbd file system inconsistency error, please report a bug). Ndbd file system error, restart node initial'. Node 2: Forced node shutdown completed. Occured during startphase 4. Initiated by signal 0. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'. Node 3: Forced node shutdown completed. Occured during startphase 4. Initiated by signal 0. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'. Node 5: Forced node shutdown completed. Occured during startphase 4. Initiated by signal 0. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.
[18 Apr 2006 21:47]
Jonathan Miller
Time: Tuesday 18 April 2006 - 23:40:57 Status: Ndbd file system error, restart node initial Message: File not found (Ndbd file system inconsistency error, please report a bug) Error: 2815 Error data: DBLQH: File system close failed. OS errno: 4294967295 Error object: DBLQH (Line: 1821) 0x0000000a Program: /home/ndbdev/jmiller/builds/libexec/ndbd Pid: 22413 Trace: /space/run/ndb_4_trace.log.2 Version: Version 5.1.10 (beta) ***EOM*** "ndb_4_error.log" 27L, 1067C 1,1 Top
[18 Apr 2006 21:48]
Jonathan Miller
--------------- Signal ---------------- r.bn: 247 "DBLQH", r.proc: 4, r.sigId: 1434885 gsn: 270 "FSWRITECONF" prio: 1 s.bn: 253 "NDBFS", s.proc: 4, s.sigId: 1434884 length: 1 trace: 0 #sec: 0 fragInf: 0 UserPointer: 2 --------------- Signal ---------------- r.bn: 253 "NDBFS", r.proc: 4, r.sigId: 1434884 gsn: 164 "CONTINUEB" prio: 1 s.bn: 253 "NDBFS", s.proc: 4, s.sigId: 1434880 length: 1 trace: 0 #sec: 0 fragInf: 0 Scanning the memory channel again with no delay --------------- Signal ---------------- r.bn: 253 "NDBFS", r.proc: 4, r.sigId: 1434883 gsn: 257 "FSCLOSEREQ" prio: 0 s.bn: 247 "DBLQH", s.proc: 4, s.sigId: 1434882 length: 5 trace: 0 #sec: 0 fragInf: 0 UserPointer: 155 FilePointer: 672 UserReference: H'00f70004 Flags: H'00000000, Don't remove file H'000002a0 H'00f70004 H'0000009b H'00000000 H'00003ab0 --------------- Signal ----------------
[19 Apr 2006 6:53]
Jonas Oreland
Can you get it repeatable on your hw ?
If no, then this is a duplicate of #13744 which has been open forever and we never manage
to reproduce it.
If yes, then it's very interesting...
[21 Apr 2006 4:46]
Jonas Oreland
Duplicate of bug#13744

Description: Hi! I was loading TPC-B and TPC-C and atae all at the same time. Received the following from the programs that were loading: DBD::mysql::st execute failed: Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER at load_tpcb.pl line 580. insert into account Error: Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER at load_tpcb.pl line 580. Error in ndbtest.cpp, line: 1661, code: 4010, msg: Node failure caused abort of transaction. ERROR 1296 (HY000) at line 1: Got error 4116 'Operation was not defined correctly, probably missing a key' from NDBCLUSTER ERROR: rc=1 SCRIPT INTERRUPTED Error Logs: Time: Tuesday 18 April 2006 - 23:21:56 Status: Temporary error, restart node Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug) Error: 2303 Error data: Node 2 killed this node because GCP stop was detected Error object: NDBCNTR (Line: 229) 0x0000000a Program: /home/ndbdev/jmiller/builds/libexec/ndbd Pid: 21845 Trace: /space/run/ndb_2_trace.log.1 Version: Version 5.1.10 (beta) --------------- Signal ---------------- r.bn: 251 "NDBCNTR", r.proc: 2, r.sigId: 14652818 gsn: 395 "SYSTEM_ERROR" prio: 0 s.bn: 246 "DBDIH", s.proc: 2, s.sigId: 14652816 length: 4 trace: 0 #sec: 0 fragInf: 0 errorRef: H'00f60002 errorCode: 3 data1: H'00000002 data2: H'00000000 --------------- Signal ---------------- r.bn: 260 "LGMAN", r.proc: 2, r.sigId: 14652817 gsn: 164 "CONTINUEB" prio: 0 s.bn: 260 "LGMAN", s.proc: 2, s.sigId: 14652815 length: 3 trace: 0 #sec: 0 fragInf: 0 H'00000002 H'00004000 H'00000001 --------------- Signal ---------------- r.bn: 246 "DBDIH", r.proc: 2, r.sigId: 14652816 gsn: 164 "CONTINUEB" prio: 0 s.bn: 246 "DBDIH", s.proc: 2, s.sigId: 14652814 length: 6 trace: 0 #sec: 0 fragInf: 0 Check GCP Stop coldGcpStatus = 2 cgcpStatus = 2 coldGcpId = 1409 cnewgcp = 1409 cgcpSameCounter = 1199 --------------- Signal ---------------- r.bn: 249 "DBTUP", r.proc: 2, r.sigId: 14652813 gsn: 332 "NEXT_SCANREQ" prio: 1 s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 14652812 length: 3 trace: 0 #sec: 0 fragInf: 0 H'00000001 H'ffffffff H'00000002 --------------- Signal ---------------- Time: Tuesday 18 April 2006 - 23:23:19 Status: Temporary error, restart node Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug) Error: 2303 Error data: Nodefailure while out of event buffer Error object: SUMA (Line: 4686) 0x0000000a Program: /home/ndbdev/jmiller/builds/libexec/ndbd Pid: 22159 Trace: /space/run/ndb_3_trace.log.1 Version: Version 5.1.10 (beta) --------------- Signal ---------------- r.bn: 257 "SUMA", r.proc: 3, r.sigId: 19602614 gsn: 26 "NODE_FAILREP" prio: 1 s.bn: 251 "NDBCNTR", s.proc: 3, s.sigId: 19602608 length: 5 trace: 8 #sec: 0 fragInf: 0 H'00000002 H'00000003 H'00000001 H'00000004 H'00000000 --------------- Signal ---------------- r.bn: 244 "BACKUP", r.proc: 3, r.sigId: 19602613 gsn: 26 "NODE_FAILREP" prio: 1 s.bn: 251 "NDBCNTR", s.proc: 3, s.sigId: 19602608 length: 5 trace: 8 #sec: 0 fragInf: 0 H'00000002 H'00000003 H'00000001 H'00000004 H'00000000 --------------- Signal ---------------- r.bn: 250 "DBDICT", r.proc: 3, r.sigId: 19602612 gsn: 26 "NODE_FAILREP" prio: 1 s.bn: 251 "NDBCNTR", s.proc: 3, s.sigId: 19602608 length: 5 trace: 8 #sec: 0 fragInf: 0 H'00000002 H'00000003 H'00000001 H'00000004 H'00000000 --------------- Signal ---------------- r.bn: 246 "DBDIH", r.proc: 3, r.sigId: 19602611 gsn: 26 "NODE_FAILREP" prio: 1 s.bn: 251 "NDBCNTR", s.proc: 3, s.sigId: 19602608 length: 5 trace: 8 #sec: 0 fragInf: 0 H'00000002 H'00000003 H'00000001 H'00000004 H'00000000 --------------- Signal ---------------- r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 19602610 gsn: 26 "NODE_FAILREP" prio: 1 s.bn: 251 "NDBCNTR", s.proc: 3, s.sigId: 19602608 length: 5 trace: 8 #sec: 0 fragInf: 0 H'00000002 H'00000003 H'00000001 H'00000004 H'00000000 --------------- Signal ---------------- r.bn: 245 "DBTC", r.proc: 3, r.sigId: 19602609 gsn: 26 "NODE_FAILREP" prio: 1 s.bn: 251 "NDBCNTR", s.proc: 3, s.sigId: 19602608 length: 5 trace: 8 #sec: 0 fragInf: 0 H'00000002 H'00000003 H'00000001 H'00000004 H'00000000 --------------- Signal ---------------- r.bn: 251 "NDBCNTR", r.proc: 3, r.sigId: 19602608 gsn: 26 "NODE_FAILREP" prio: 1 s.bn: 252 "QMGR", s.proc: 3, s.sigId: 19602604 length: 5 trace: 8 #sec: 0 fragInf: 0 H'00000002 H'00000003 H'00000001 H'00000004 H'00000000 --------------- Signal ---------------- r.bn: 254 "CMVMI", r.proc: 3, r.sigId: 19602607 gsn: 247 "EVENT_REP" prio: 1 s.bn: 252 "QMGR", s.proc: 3, s.sigId: 19602603 length: 7 trace: 8 #sec: 0 fragInf: 0 H'0000001d H'0006000b H'00000001 H'55550001 H'aeac3719 H'00000000 H'00000000 --------------- Signal ---------------- r.bn: 254 "CMVMI", r.proc: 3, r.sigId: 19602606 gsn: 247 "EVENT_REP" prio: 1 s.bn: 252 "QMGR", s.proc: 3, s.sigId: 19602603 length: 7 trace: 8 #sec: 0 fragInf: 0 H'0000001e H'0006002b H'00000001 H'55550001 H'aeac3719 H'00000000 H'00000000 --------------- Signal ---------------- r.bn: 252 "QMGR", r.proc: 3, r.sigId: 19602605 gsn: 153 "COMMIT_FAILCONF" prio: 0 s.bn: 252 "QMGR", s.proc: 3, s.sigId: 19602604 length: 1 trace: 8 #sec: 0 fragInf: 0 H'00000003 --------------- Signal ---------------- r.bn: 252 "QMGR", r.proc: 3, r.sigId: 19602604 gsn: 154 "COMMIT_FAILREQ" prio: 0 s.bn: 252 "QMGR", s.proc: 3, s.sigId: 19602603 length: 2 trace: 8 #sec: 0 fragInf: 0 H'00fc0003 H'00000002 --------------- Signal ---------------- r.bn: 252 "QMGR", r.proc: 3, r.sigId: 19602603 gsn: 343 "PREP_FAILCONF" prio: 0 s.bn: 252 "QMGR", s.proc: 4, s.sigId: 8740999 length: 2 trace: 8 #sec: 0 fragInf: 0 H'00000004 H'00000002 --------------- Signal ---------------- How to repeat: Setup: Cluster Config: Host:07 - MySQLD & ndb_mgmd Host:08 - MySQLD Host:09 - MySQLD Host:13 - NDBD & NDBD Host:14 - NDBD & NDBD run: $> perl load_tpcb.pl -e ndbmix -ho {host name (Host:07)} $> ./mysql_load_db.sh -d dbt2 -f /space/var -e NDB -h {host name(Host:08)} -u root -m ORIG $> time ./ndbtest \ -mysqlUser=root \ -mysqldSocket=/tmp/mysql3306.sock \ -database=atae \ -ndbConnectString='host=host:port \ -valueFile=dbcolumns2segs.txt \ -#traceValueFile \ -#insertStartRecordNum=0 -insertNumRecords=50000000 \ -insertStartRecordNum=0 -insertNumRecords=5000000 \ -maxOperationsPerTxn=10000