Bug #37227 | Node 3 killed this node because GCP stop was detected | ||
---|---|---|---|
Submitted: | 5 Jun 2008 15:08 | Modified: | 7 Jun 2011 13:54 |
Reporter: | Mark Malakanov | Email Updates: | |
Status: | Duplicate | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
Version: | mysql-5.1-telco-6.2 | OS: | Linux (F8, 2.6.24.7-92.fc8 SMP x86_64) |
Assigned to: | CPU Architecture: | Any | |
Tags: | GCP, mysql-5.1.23 ndb-6.2.15 -> ndb-6.4.0 |
[5 Jun 2008 15:08]
Mark Malakanov
[5 Jun 2008 15:15]
Mark Malakanov
ndb config
Attachment: my-ndb.cfg (text/plain), 4.34 KiB.
[5 Jun 2008 15:16]
Mark Malakanov
logs
Attachment: nb_3_logs.zip (application/x-zip-compressed, text), 357.08 KiB.
[25 Jun 2008 8:31]
Feelora Gee
I meet the same question. hey, how to fixed it?
[30 Jun 2008 10:07]
Jonas Oreland
Hi, We have found this at a customer site aswell, and suspect that it's ntp causing problem. Therefor we changed to using CLOCK_MONOTONIC, this patch is in 6.3.16 but we have made no new 6.2.x-release since. Could any/both of you investigate if this could be a cause? /Jonas
[30 Jul 2008 23:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".
[14 Nov 2008 20:21]
Bogdan Kecman
same problem on customer site: server1: datanode1 + datanode3 server2: datanode2 + datanode4 server3: mgm node + api node server4: mgm node + api node server5-x: api node restart server2 and datanode1 dies with: Error: 2303 Error data: Node 1 killed this node because GCP stop was detected Error object: NDBCNTR (Line: 235) 0x00000002 If you kill -9 all ndbd processes on server2 problem does not occur but shutdown -rf now get's it every time. ALL servers have time synchronised (ntpd)
[16 Nov 2008 12:57]
Geert Vanderkelen
Different error, but similar effect..
Attachment: ndb_error_report_20081116135307.tar.bz2 (application/octet-stream, text), 49.33 KiB.
[16 Nov 2008 13:02]
Geert Vanderkelen
Similar effect, but different error, using MySQL Cluster 6.3.17: Doing a `shutdown -rf now` on the machine running Data Node 3 and 5, they are losing heartbeats: 2008-11-16 13:27:11 [MgmSrvr] WARNING -- Node 3: Node 1 missed heartbeat 2 2008-11-16 13:27:11 [MgmSrvr] WARNING -- Node 5: Node 1 missed heartbeat 2 2008-11-16 13:27:12 [MgmSrvr] WARNING -- Node 4: Node 3 missed heartbeat 2 2008-11-16 13:27:13 [MgmSrvr] WARNING -- Node 4: Node 3 missed heartbeat 3 Shortly after: 2008-11-16 13:27:22 [MgmSrvr] ALERT -- Node 2: Forced node shutdown completed. Caused by error 2315: 'Node declared dead. See error log for details(Arbitration error). Temporary error, restart node'. 2008-11-16 13:27:48 [MgmSrvr] ALERT -- Node 4: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'. Now, I dod the same thing just minutes laters (parts of log): 2008-11-16 13:42:58 [MgmSrvr] INFO -- Node 3: Node shutdown completed. Initiated by signal 15. 2008-11-16 13:42:59 [MgmSrvr] INFO -- Node 2: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 2 sig->failNo = 2008-11-16 13:42:59 [MgmSrvr] INFO -- Node 4: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 2 sig->failNo = 2008-11-16 13:42:59 [MgmSrvr] INFO -- Node 5: Node shutdown completed. Initiated by signal 15. So, one time it has heartbeats failures, and it takes the other nodes with it. Later on signal 15 was recognized and others nodes didn't fail. Logs attached.
[18 Dec 2008 0:35]
Jonathan Miller
Hi, I am now hitting this pretty often in ACRT using NDBAtomics. Current byte-offset of file-pointer is: 568 Time: Thursday 18 December 2008 - 03:10:55 Status: Temporary error, restart node Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug) Error: 2303 Error data: Node 3 killed this node because GCP stop was detected Error object: NDBCNTR (Line: 255) 0x00000006 Program: /data0/cr_autotest/libexec/ndbd Pid: 25606 Trace: ./ndb_3_trace.log.1 Version: mysql-5.1.30 ndb-6.4.0-alpha ***EOM***
[24 Dec 2008 13:03]
Jonathan Miller
TPCC-DD result-2-dn-dd-mysql-5.1-telco-6.4/2008-12-24_01:41/result.1/ndbd.2/ndb_3_error.log Time: Wednesday 24 December 2008 - 04:05:04 Status: Temporary error, restart node Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug) Error: 2303 Error data: Node 3 killed this node because GCP stop was detected Error object: NDBCNTR (Line: 255) 0x00000006 Program: /data0/cr_autotest/libexec/ndbd Pid: 14075 Trace: ./ndb_3_trace.log.1 Version: mysql-5.1.30 ndb-6.4.1-beta ***EOM***
[24 Dec 2008 13:04]
Jonathan Miller
TPCC-DD result-2-dn-perf-mysql-5.1-telco-6.4/2008-12-24_02:12/result.6/ndbd.2/ndb_3_error.log Time: Wednesday 24 December 2008 - 06:48:34 Status: Temporary error, restart node Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug) Error: 2303 Error data: Node 3 killed this node because GCP stop was detected Error object: NDBCNTR (Line: 255) 0x00000006 Program: /data0/cr_autotest/libexec/ndbd Pid: 16250 Trace: ./ndb_3_trace.log.1 Version: mysql-5.1.30 ndb-6.4.1-beta
[24 Dec 2008 13:07]
Jonathan Miller
File: result-2-dn-perf-mysql-5.1-telco-6.4/2008-12-24_02:12/result.6/ndb_api.1/log.out + command_exec '/data0/cr_autotest/bin/mysql -h ndb18 -u root --port 15001 --protocol=tcp -e "SELECT LOGFILE_GROUP_NAME, LOGFILE_GROUP_NUMBER, EXTRA FROM INFORMATION_SCHEMA.FILES"' + '[' -n '' ']' + eval '/data0/cr_autotest/bin/mysql -h ndb18 -u root --port 15001 --protocol=tcp -e "SELECT LOGFILE_GROUP_NAME, LOGFILE_GROUP_NUMBER, EXTRA FROM INFORMATION_SCHEMA.FILES"' ++ /data0/cr_autotest/bin/mysql -h ndb18 -u root --port 15001 --protocol=tcp -e 'SELECT LOGFILE_GROUP_NAME, LOGFILE_GROUP_NUMBER, EXTRA FROM INFORMATION_SCHEMA.FILES' + rc=0 + '[' 0 -ne 0 ']' + command_exec '/data0/cr_autotest/bin/mysql -h ndb18 -u root --port 15001 --protocol=tcp -e "CREATE LOGFILE GROUP lg1 ADD UNDOFILE '\''undofile.dat'\'' initial_size 512M ENGINE=NDB"' + '[' -n '' ']' + eval '/data0/cr_autotest/bin/mysql -h ndb18 -u root --port 15001 --protocol=tcp -e "CREATE LOGFILE GROUP lg1 ADD UNDOFILE '\''undofile.dat'\'' initial_size 512M ENGINE=NDB"' ++ /data0/cr_autotest/bin/mysql -h ndb18 -u root --port 15001 --protocol=tcp -e 'CREATE LOGFILE GROUP lg1 ADD UNDOFILE '\''undofile.dat'\'' initial_size 512M ENGINE=NDB' + rc=0 + '[' 0 -ne 0 ']' + command_exec '/data0/cr_autotest/bin/mysql -h ndb18 -u root --port 15001 --protocol=tcp -e "CREATE TABLESPACE ts1 ADD DATAFILE '\''datafile.dat'\'' USE LOGFILE GROUP lg1 initial_size 3G ENGINE=NDB"' + '[' -n '' ']' + eval '/data0/cr_autotest/bin/mysql -h ndb18 -u root --port 15001 --protocol=tcp -e "CREATE TABLESPACE ts1 ADD DATAFILE '\''datafile.dat'\'' USE LOGFILE GROUP lg1 initial_size 3G ENGINE=NDB"' ++ /data0/cr_autotest/bin/mysql -h ndb18 -u root --port 15001 --protocol=tcp -e 'CREATE TABLESPACE ts1 ADD DATAFILE '\''datafile.dat'\'' USE LOGFILE GROUP lg1 initial_size 3G ENGINE=NDB' + rc=0 + '[' 0 -ne 0 ']' + '[' 1 -gt 1 ']' + '[' X '!=' X ']' + echo 'Creation of DISK DATA Complete' + exit 0 ERRORERROR 1297 (HY000) at line 1: Got temporary error 4025 'Node failure caused abort of transaction' from NDBCLUSTER 1297 (HY000) at line 1: Got temporary error 4025 'Node failure caused abort of transaction' from NDBCLUSTER
[24 Dec 2008 17:46]
Jonathan Miller
ndbatomics-dd result-2-dn-dd-mysql-5.1-telco-6.4/2008-12-24_16:49/result.1/ndbd.1/ndb_2_error.log 2008-12-24 19:44:27 [MgmSrvr] ALERT -- Node 3: Node 2 Disconnected 2008-12-24 19:44:27 [MgmSrvr] INFO -- Node 3: Communication to Node 2 closed 2008-12-24 19:44:27 [MgmSrvr] ALERT -- Node 3: Network partitioning - arbitration required 2008-12-24 19:44:27 [MgmSrvr] INFO -- Node 3: President restarts arbitration thread [state=7] 2008-12-24 19:44:27 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected 2008-12-24 19:44:27 [MgmSrvr] ALERT -- Node 3: Arbitration won - positive reply from node 1 2008-12-24 19:44:27 [MgmSrvr] INFO -- Node 3: GCP Take over started 2008-12-24 19:44:27 [MgmSrvr] INFO -- Node 3: Node 3 taking over as DICT master 2008-12-24 19:44:27 [MgmSrvr] INFO -- Node 3: GCP Take over completed 2008-12-24 19:44:27 [MgmSrvr] INFO -- Node 3: kk: 976/9 2 0 2008-12-24 19:44:27 [MgmSrvr] ALERT -- Node 2: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. 2008-12-24 19:44:27 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected 2008-12-24 19:44:27 [MgmSrvr] INFO -- Node 3: Started arbitrator node 1 [ticket=52ea00024f0c15d4] 2008-12-24 19:44:28 [MgmSrvr] ALERT -- Node 3: Node 5 Disconnected 2008-12-24 19:44:28 [MgmSrvr] INFO -- Node 3: Communication to Node 5 closed 2008-12-24 19:44:28 [MgmSrvr] INFO -- Mgmt server state: nodeid 5 freed, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000000000012. Time: Wednesday 24 December 2008 - 19:44:27 Status: Temporary error, restart node Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug) Error: 2303 Error data: Node 2 killed this node because GCP stop was detected Error object: NDBCNTR (Line: 255) 0x00000006 Program: /data0/cr_autotest/libexec/ndbd Pid: 24187 Trace: ./ndb_2_trace.log.1 Version: mysql-5.1.30 ndb-6.4.1-beta ***EOM***
[7 Jan 2009 21:41]
Jonathan Miller
Current byte-offset of file-pointer is: 568 Time: Wednesday 7 January 2009 - 11:42:34 Status: Temporary error, restart node Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug) Error: 2303 Error data: Node 3 killed this node because GCP stop was detected Error object: NDBCNTR (Line: 255) 0x00000006 Program: /data0/cr_autotest/libexec/ndbmtd Pid: 22106 Trace: ./ndb_3_trace.log.1 ./ndb_3_trace.log.1_t1 ./ndb_3_trace.log.1_t2 ./ndb_3_trace.log
[7 Jan 2009 21:43]
Jonathan Miller
ndb_3_trace.log.1 DBTC 004430 DBTC 000190 000222 004850 004850 004850 004850 004850 004850 004850 004850 004850 004850 004850 004850 004850 004850 004850 004853 DBTC 004430 DBTC 000190 000222 004850 004850 004850 004850 004850 004850 004850 004850 004873 DBTC 004430 DBTC 004430 QMGR 003059 003197 DBTC 004430 DBTC 004430 016016 013076 NDBCNTR 000210 000255 --------------- Signal ---------------- r.bn: 246 "DBDIH", r.proc: 3, r.sigId: 114006583 gsn: 465 "DUMP_STATE_ORD" prio: 0 s.bn: 246 "DBDIH", s.proc: 2, s.sigId: 101194834 length: 1 trace: 0 #sec: 0 fragInf: 0 H'00001b6e --------------- Signal ---------------- r.bn: 252 "QMGR", r.proc: 3, r.sigId: 114006582 gsn: 3 "API_REGREQ" prio: 0 s.bn: 4002 "API", s.proc: 1, s.sigId: 0 length: 3 trace: 0 #sec: 0 fragInf: 0 H'0fa20001 H'00060401 H'0005011e --------------- Signal ---------------- r.bn: 245 "DBTC", r.proc: 3, r.sigId: 114006581 gsn: 164 "CONTINUEB" prio: 1 s.bn: 245 "DBTC", s.proc: 3, s.sigId: 114006580 length: 3 trace: 1 #sec: 0 fragInf: 0 H'00000006 H'0000003b H'0003db5b --------------- Signal ---------------- r.bn: 245 "DBTC", r.proc: 3, r.sigId: 114006580 gsn: 164 "CONTINUEB" prio: 1 s.bn: 245 "DBTC", s.proc: 3, s.sigId: 114006579 length: 3 trace: 1 #sec: 0 fragInf: 0 H'00000006 H'0000003b H'0003da7f --------------- Signal ---------------- r.bn: 245 "DBTC", r.proc: 3, r.sigId: 114006579 gsn: 164 "CONTINUEB" prio: 1 s.bn: 245 "DBTC", s.proc: 3, s.sigId: 114006577 length: 3 trace: 1 #sec: 0 fragInf: 0 H'00000006 H'0000003b H'0003dcc5 --------------- Signal ---------------- r.bn: 245 "DBTC", r.proc: 3, r.sigId: 114006578 gsn: 160 "COMPLETECONF" prio: 1 s.bn: 247/1 "DBLQH", s.proc: 3, s.sigId: 336796444 length: 4 trace: 1 #sec: 0 fragInf: 0 H'0003e002 H'00000003 H'0001f751 H'00300500 --------------- Signal ----------------
[7 Jan 2009 21:44]
Jonathan Miller
ndb_3_trace.log.1_t1 LGMAN 001440 001457 001461 001467 TSMAN 001937 000454 PGMAN 001600 000153 000153 LGMAN 001797 --------------- Signal ---------------- r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 883179 gsn: 761 "STOP_FOR_CRASH" prio: 0 s.bn: 0 "SYS", s.proc: 0, s.sigId: 0 length: 1 trace: 0 #sec: 0 fragInf: 0 H'00000000 --------------- Signal ---------------- r.bn: 260 "LGMAN", r.proc: 3, r.sigId: 883178 gsn: 270 "FSWRITECONF" prio: 1 s.bn: 253 "NDBFS", s.proc: 3, s.sigId: 114006521 length: 1 trace: 0 #sec: 0 fragInf: 0 UserPointer: -2145861632 --------------- Signal ---------------- r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 883177 gsn: 409 "TIME_SIGNAL" prio: 1 s.bn: 252 "QMGR", s.proc: 3, s.sigId: 114006520 length: 1 trace: 0 #sec: 0 fragInf: 0 H'00000004 --------------- Signal ---------------- r.bn: 260 "LGMAN", r.proc: 3, r.sigId: 883176 gsn: 164 "CONTINUEB" prio: 1 s.bn: 260 "LGMAN", s.proc: 3, s.sigId: 883175 length: 3 trace: 0 #sec: 0 fragInf: 0 H'00000002 H'1890c000 H'00000000 --------------- Signal ---------------- r.bn: 260 "LGMAN", r.proc: 3, r.sigId: 883175 gsn: 164 "CONTINUEB" prio: 0 s.bn: 260 "LGMAN", s.proc: 3, s.sigId: 883173 length: 3 trace: 0 #sec: 0 fragInf: 0 H'00000002 H'1890c000 H'00000001
[7 Jan 2009 21:45]
Jonathan Miller
ndb_3_trace.log.1_t2 DBLQH 002801 DBTUP 010029 DBLQH 002801 DBTUP 010029 DBLQH 002801 DBTUP 010029 --------------- Signal ---------------- r.bn: 247/1 "DBLQH", r.proc: 3, r.sigId: 336796598 gsn: 761 "STOP_FOR_CRASH" prio: 0 s.bn: 0 "SYS", s.proc: 0, s.sigId: 0 length: 1 trace: 0 #sec: 0 fragInf: 0 H'00000000 --------------- Signal ---------------- r.bn: 248/1 "DBACC", r.proc: 3, r.sigId: 336796597 gsn: 253 "EXPANDCHECK2" prio: 1 s.bn: 248/1 "DBACC", s.proc: 3, s.sigId: 336796596 length: 3 trace: 1 #sec: 0 fragInf: 0 H'00000016 H'00001ae2 H'00003fff --------------- Signal ---------------- r.bn: 248/1 "DBACC", r.proc: 3, r.sigId: 336796596 gsn: 253 "EXPANDCHECK2" prio: 1 s.bn: 248/1 "DBACC", s.proc: 3, s.sigId: 336796595 length: 3 trace: 1 #sec: 0 fragInf: 0 H'00000016 H'00001ae1 H'00003fff --------------- Signal ---------------- r.bn: 248/1 "DBACC", r.proc: 3, r.sigId: 336796595 gsn: 253 "EXPANDCHECK2" prio: 1 s.bn: 248/1 "DBACC", s.proc: 3, s.sigId: 336796594 length: 3 trace: 1 #sec: 0 fragInf: 0 H'00000016 H'00001ae0 H'00003fff --------------- Signal ---------------- r.bn: 248/1 "DBACC", r.proc: 3, r.sigId: 336796594 gsn: 253 "EXPANDCHECK2" prio: 1 s.bn: 248/1 "DBACC", s.proc: 3, s.sigId: 336796593 length: 3 trace: 1 #sec: 0 fragInf: 0 H'00000016 H'00001adf H'00003fff --------------- Signal ---------------- r.bn: 248/1 "DBACC", r.proc: 3, r.sigId: 336796593 gsn: 253 "EXPANDCHECK2" prio: 1 s.bn: 248/1 "DBACC", s.proc: 3, s.sigId: 336796592 length: 3 trace: 1 #sec: 0 fragInf: 0 H'00000016 H'00001ade H'00003fff --------------- Signal ---------------- r.bn: 248/1 "DBACC", r.proc: 3, r.sigId: 336796592 gsn: 253 "EXPANDCHECK2" prio: 1 s.bn: 248/1 "DBACC", s.proc: 3, s.sigId: 336796527 length: 3 trace: 1 #sec: 0 fragInf: 0 H'00000016 H'00001add H'00003fff --------------- Signal ---------------- r.bn: 247/3 "DBLQH", r.proc: 3, r.sigId: 336796591 gsn: 152 "COMMIT" prio: 1 s.bn: 247/1 "DBLQH", s.proc: 2, s.sigId: 242794995 length: 5 trace: 1 #sec: 0 fragInf: 0 H'0000011d H'00000a01 H'0001f780 H'00300500 H'00000001 --------------- Signal ---------------- r.bn: 247/3 "DBLQH", r.proc: 3, r.sigId: 336796590 gsn: 152 "COMMIT" prio: 1 s.bn: 247/1 "DBLQH", s.proc: 2, s.sigId: 242794994 length: 5 trace: 1 #sec: 0 fragInf: 0 H'0000015e H'00000a01 H'0001f780 H'00300500 H'00000001
[7 Jan 2009 21:46]
Jonathan Miller
ndb_3_trace.log.1_t3 DBTUP 065233 065247 DBACC 000081 000119 DBLQH 002801 DBTUP 010029 DBLQH 002801 DBTUP 010029 --------------- Signal ---------------- r.bn: 247/2 "DBLQH", r.proc: 3, r.sigId: 286619747 gsn: 761 "STOP_FOR_CRASH" prio: 0 s.bn: 0 "SYS", s.proc: 0, s.sigId: 0 length: 1 trace: 0 #sec: 0 fragInf: 0 H'00000000 --------------- Signal ---------------- r.bn: 248/2 "DBACC", r.proc: 3, r.sigId: 286619746 gsn: 164 "CONTINUEB" prio: 0 s.bn: 248/2 "DBACC", s.proc: 3, s.sigId: 285196946 length: 2 trace: 0 #sec: 0 fragInf: 0 H'00000008 H'00000014 --------------- Signal ---------------- r.bn: 249/2 "DBTUP", r.proc: 3, r.sigId: 286619745 gsn: 164 "CONTINUEB" prio: 0 s.bn: 249/2 "DBTUP", s.proc: 3, s.sigId: 285196945 length: 2 trace: 0 #sec: 0 fragInf: 0 H'00000008 H'00000014 --------------- Signal ---------------- r.bn: 248/2 "DBACC", r.proc: 3, r.sigId: 286619744 gsn: 253 "EXPANDCHECK2" prio: 1 s.bn: 248/2 "DBACC", s.proc: 3, s.sigId: 286619743 length: 3 trace: 1 #sec: 0 fragInf: 0 H'00000016 H'00001ab7 H'00003fff --------------- Signal ---------------- r.bn: 248/2 "DBACC", r.proc: 3, r.sigId: 286619743 gsn: 253 "EXPANDCHECK2" prio: 1 s.bn: 248/2 "DBACC", s.proc: 3, s.sigId: 286619742 length: 3 trace: 1 #sec: 0 fragInf: 0 H'00000016 H'00001ab6 H'00003fff --------------- Signal ---------------- r.bn: 248/2 "DBACC", r.proc: 3, r.sigId: 286619742 gsn: 253 "EXPANDCHECK2" prio: 1 s.bn: 248/2 "DBACC", s.proc: 3, s.sigId: 286619741 length: 3 trace: 1 #sec: 0 fragInf: 0 H'00000016 H'00001ab5 H'00003fff --------------- Signal ---------------- r.bn: 248/2 "DBACC", r.proc: 3, r.sigId: 286619741 gsn: 253 "EXPANDCHECK2" prio: 1 s.bn: 248/2 "DBACC", s.proc: 3, s.sigId: 286619740 length: 3 trace: 1 #sec: 0 fragInf: 0 H'00000016 H'00001ab4 H'00003fff --------------- Signal ---------------- r.bn: 248/2 "DBACC", r.proc: 3, r.sigId: 286619740 gsn: 253 "EXPANDCHECK2" prio: 1 s.bn: 248/2 "DBACC", s.proc: 3, s.sigId: 286619739 length: 3 trace: 1 #sec: 0 fragInf: 0 H'00000016 H'00001ab3 H'00003fff --------------- Signal ----------------
[7 Jan 2009 21:47]
Jonathan Miller
ndb_3_trace.log.1_t4 CMVMI 000238 CMVMI 000238 CMVMI 000238 CMVMI 000238 CMVMI 000238 CMVMI 000238 CMVMI 000238 --------------- Signal ---------------- r.bn: 254 "CMVMI", r.proc: 3, r.sigId: 17526 gsn: 761 "STOP_FOR_CRASH" prio: 0 s.bn: 0 "SYS", s.proc: 0, s.sigId: 0 length: 1 trace: 0 #sec: 0 fragInf: 0 H'00000000 --------------- Signal ---------------- r.bn: 254 "CMVMI", r.proc: 3, r.sigId: 17525 gsn: 247 "EVENT_REP" prio: 0 s.bn: 0 "SYS", s.proc: 3, s.sigId: 0 length: 3 trace: 0 #sec: 0 fragInf: 0 H'00000029 H'00000002 H'000015b4 --------------- Signal ---------------- r.bn: 254 "CMVMI", r.proc: 3, r.sigId: 17524 gsn: 247 "EVENT_REP" prio: 1 s.bn: 0 "SYS", s.proc: 0, s.sigId: 0 length: 6 trace: 0 #sec: 0 fragInf: 0 H'00000046 H'00000003 H'0000005a H'0000048f H'001e8427 H'00f42345 --------------- Signal ---------------- r.bn: 254 "CMVMI", r.proc: 3, r.sigId: 17523 gsn: 247 "EVENT_REP" prio: 1 s.bn: 0 "SYS", s.proc: 0, s.sigId: 0 length: 6 trace: 0 #sec: 0 fragInf: 0 H'00000046 H'00000002 H'00000010 H'00000083 H'001e8471 H'00f42581 --------------- Signal ----------------
[16 Jan 2009 22:04]
Jonathan Miller
Current byte-offset of file-pointer is: 568 Time: Wednesday 14 January 2009 - 11:11:46 Status: Temporary error, restart node Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug) Error: 2303 Error data: Node 3 killed this node because GCP stop was detected Error object: NDBCNTR (Line: 255) 0x00000006 Program: /data0/cr_autotest/libexec/ndbd Pid: 29614 Trace: ./ndb_3_trace.log.1 Version: mysql-5.1.30 ndb-6.4.1-beta ***EOM***
[2 Feb 2009 14:13]
Jonathan Miller
Hitting GCP STOP is easier in 6.4 if you use less then 8 data file for Disk Data. This is due to the new readers thread parameter "ThreadPool". (default = 8). If you are hitting "GCP STOP" in 6.4, you should adjust the "ThreadPool" to be = to the # of disk data files you have.
[2 Feb 2009 14:29]
Jon Stephens
Tomas asked me to look into this for discussion with Jonas and documenting known issue(s) - I've put bug in Documenting status (at least for the time being) so it'll show up in my work queue and I don't forget.
[9 Feb 2009 12:34]
Jon Stephens
Added info about GCP stop issues to docs, will appear at http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndbd-definition.html#mysql-cluster-nd... and http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndbd-definition.html#mysql-cluster-nd... Also documented ThreadPool parameter (which wasn't previously documented) Commit to docs is here: http://lists.mysql.com/commits/65603 Docs work for now being finished, I've returned this bug to Verified status for further handling by NDB devs/Support/QA.
[25 Mar 2009 11:06]
Jonathan Miller
please also see http://bugs.mysql.com/bug.php?id=41292
[25 Mar 2009 11:08]
Jonathan Miller
see also http://bugs.mysql.com/bug.php?id=43069
[16 Apr 2009 13:26]
Jonathan Miller
http://bugs.mysql.com/bug.php?id=43882 http://bugs.mysql.com/bug.php?id=39498
[24 Sep 2010 12:49]
Manu Agarwal
HI, I got the same error in the production today. I am not using any disk tables, all my data is in memeory, but even then the mysql-cluster crashed. Please find below the part of error logs, 2010-09-23 19:46:51 [MgmtSrvr] INFO -- Node 2: Local checkpoint 12005 started. Keep GCI = 313401 oldest restorable GCI = 313734 2010-09-23 19:59:55 [MgmtSrvr] INFO -- Node 2: Local checkpoint 12005 completed 2010-09-23 19:59:56 [MgmtSrvr] INFO -- Node 2: Local checkpoint 12006 started. Keep GCI = 313783 oldest restorable GCI = 314002 2010-09-23 20:00:02 [MgmtSrvr] ALERT -- Node 2: Node 55 Disconnected 2010-09-23 20:00:02 [MgmtSrvr] INFO -- Node 2: Communication to Node 55 closed 2010-09-23 20:00:03 [MgmtSrvr] INFO -- Mgmt server state: nodeid 55 freed, m_reserved_nodes 1, 53, 54, 56, 57 and 58. 2010-09-23 20:00:04 [MgmtSrvr] ALERT -- Node 2: Node 58 Disconnected 2010-09-23 20:00:04 [MgmtSrvr] INFO -- Node 2: Communication to Node 58 closed 2010-09-23 20:00:04 [MgmtSrvr] ALERT -- Node 2: Node 57 Disconnected 2010-09-23 20:00:04 [MgmtSrvr] INFO -- Node 2: Communication to Node 57 closed 2010-09-23 20:00:04 [MgmtSrvr] INFO -- Mgmt server state: nodeid 58 freed, m_reserved_nodes 1, 53, 54, 56 and 57. 2010-09-23 20:00:04 [MgmtSrvr] INFO -- Mgmt server state: nodeid 57 freed, m_reserved_nodes 1, 53, 54 and 56. 2010-09-23 20:00:06 [MgmtSrvr] INFO -- Node 2: Communication to Node 55 opened 2010-09-23 20:00:07 [MgmtSrvr] INFO -- Node 2: Communication to Node 58 opened 2010-09-23 20:00:07 [MgmtSrvr] ALERT -- Node 2: Node 56 Disconnected 2010-09-23 20:00:07 [MgmtSrvr] INFO -- Node 2: Communication to Node 56 closed 2010-09-23 20:00:08 [MgmtSrvr] INFO -- Node 2: Communication to Node 57 opened 2010-09-23 20:00:08 [MgmtSrvr] INFO -- Mgmt server state: nodeid 56 freed, m_reserved_nodes 1, 53 and 54. 2010-09-23 20:00:11 [MgmtSrvr] INFO -- Node 2: Communication to Node 56 opened 2010-09-23 20:13:00 [MgmtSrvr] INFO -- Node 2: Local checkpoint 12006 completed 2010-09-23 20:13:02 [MgmtSrvr] INFO -- Node 2: Local checkpoint 12007 started. Keep GCI = 314165 oldest restorable GCI = 314166 2010-09-23 20:26:06 [MgmtSrvr] INFO -- Node 2: Local checkpoint 12007 completed 2010-09-23 20:26:07 [MgmtSrvr] INFO -- Node 2: Local checkpoint 12008 started. Keep GCI = 314550 oldest restorable GCI = 314171 2010-09-23 20:30:40 [MgmtSrvr] ALERT -- Node 2: Node 54 Disconnected 2010-09-23 20:30:40 [MgmtSrvr] ALERT -- Node 2: Node 53 Disconnected 2010-09-23 20:30:46 [MgmtSrvr] ALERT -- Node 1: Node 2 Disconnected 2010-09-23 20:30:51 [MgmtSrvr] ALERT -- Node 1: Node 2 Disconnected 2010-09-23 20:31:09 [MgmtSrvr] ALERT -- Node 2: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. 2010-09-23 21:32:09 [MgmtSrvr] INFO -- Mgmt server state: nodeid 54 freed, m_reserved_nodes 1 and 53. Since my data was in memory, its all lost, before i could take the dump of it. What could be the possible reason for this or is it related to this bug only. Thanks, Manu
[7 Jun 2011 13:54]
MySQL Verification Team
Please try using version 7.1.10 or later, there have been added some new features relating to GCP Stop detection that should avoid these in the future. http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-news-5-1-51-ndb-7-1-10.html Important Change: The following changes have been made with regard to the TimeBetweenEpochsTimeout data node configuration parameter: * The maximum possible value for this parameter has been increased from 32000 milliseconds to 256000 milliseconds. * Setting this parameter to zero now has the effect of disabling GCP stops caused by save timeouts, commit timeouts, or both. * The current value of this parameter and a warning are written to the cluster log whenever a GCP save takes longer than 1 minute or a GCP commit takes longer than 10 seconds. # Made it possible to exercise more direct control over handling of timeouts occurring when trying to flush redo logs to disk using two new data node configuration parameters RedoOverCommitCounter and RedoOverCommitLimit, as well as the new API node configuration parameter DefaultOperationRedoProblemAction, all added in this release. Now, when such timeouts occur more than a specified number of times for the flushing of a given redo log, any transactions that were to be written are instead aborted, and the operations contained in those transactions can be either re-tried or themselves aborted. For more information, see Redo log over-commit handling. http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndbd-definition.html#mysql-cluster-re... Duplicate of Bug #58383