Bug #26907 | random temporary errors causing ndb node failure | ||
---|---|---|---|
Submitted: | 7 Mar 2007 8:57 | Modified: | 16 Mar 2008 15:02 |
Reporter: | Paul Maunders | Email Updates: | |
Status: | No Feedback | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S1 (Critical) |
Version: | 5.1.11 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[7 Mar 2007 8:57]
Paul Maunders
[12 Mar 2007 10:21]
Hartmut Holzgraefe
Hi, to analyze this kind of errors we need the full set of cluster log files (cluster log, node error logs, node trace files, node out.log files) and the cluster configuration file. You may find the ndb_error_reporter tool usefull in cases like this: http://dev.mysql.com/doc/refman/5.0/en/mysql-cluster-utilities-ndb-error-reporter.html
[12 Mar 2007 11:06]
Paul Maunders
ok thanks, how do I install that? It doesn't seem to be part of the ndb tools that came with the rpm I used to install the cluster initially.
[12 Mar 2007 11:09]
Paul Maunders
These are all the rpms I have installed: MySQL-devel-5.1.11-0.glibc23 MySQL-ndb-tools-5.1.11-0.glibc23 MySQL-client-5.1.11-0.glibc23 MySQL-ndb-extra-5.1.11-0.glibc23 MySQL-ndb-storage-5.1.11-0.glibc23 MySQL-server-5.1.11-0.glibc23 MySQL-shared-5.1.11-0.glibc23 MySQL-ndb-management-5.1.11-0.glibc23
[12 Mar 2007 11:13]
Paul Maunders
I notice according to this bug, they were not added until 5.1.12: http://bugs.mysql.com/bug.php?id=20426 Since I am using 5.1.11, how can I manually install them?
[12 Mar 2007 11:29]
Hartmut Holzgraefe
In the absence of ndb_error_reporter just collect all ndb_*.log and nbd_*trace.log.* files manually, ndb_error_reporter is just a convenient way to automate this process.
[12 Mar 2007 11:32]
Paul Maunders
The same error has re-occured twice on a different node in our cluster: Time: Thursday 8 Mars 2007 - 11:33:53 Status: Temporary error, restart node Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug) Error: 2341 Error data: dbtup/DbtupExecQuery.cpp Error object: DBTUP (Line: 1431) 0x0000000a Program: ndbd Pid: 12812 Trace: /var/lib/mysql-cluster/ndb_14_trace.log.5 Version: Version 5.1.11 (beta) ***EOM*** Time: Saturday 10 Mars 2007 - 18:16:26 Status: Temporary error, restart node Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug) Error: 2341 Error data: dbtup/DbtupExecQuery.cpp Error object: DBTUP (Line: 1431) 0x0000000a Program: ndbd Pid: 17098 Trace: /var/lib/mysql-cluster/ndb_14_trace.log.6 Version: Version 5.1.11 (beta) ***EOM*** I will now collect all the related error logs and upload them as a tar.gz
[12 Mar 2007 14:20]
Paul Maunders
I have uploaded the logs to ftp.mysql.com The filename is bug-data-26907.tar.gz
[14 Mar 2007 15:39]
Paul Maunders
This error happened again on node 13 yesterday: Time: Tuesday 13 Mars 2007 - 11:10:54 Status: Temporary error, restart node Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug) Error: 2341 Error data: dbtup/DbtupExecQuery.cpp Error object: DBTUP (Line: 1431) 0x0000000a Program: ndbd Pid: 15370 Trace: /var/lib/mysql-cluster/ndb_13_trace.log.9 Version: Version 5.1.11 (beta) ***EOM***
[16 Jul 2007 12:49]
Paul Maunders
Hi, Has there been any progress on this bug? Regards, Paul
[30 Nov 2007 14:28]
Hartmut Holzgraefe
Failure is in DbtupExecQuery.cpp line 1431 in member function int Dbtup::handleInsertReq(Signal* signal, Ptr<Operationrec> regOperPtr, Ptr<Fragrecord> fragPtr, Tablerec* regTabPtr, KeyReqStruct *req_struct) [...] 1420 else 1421 { 1422 if ((req_struct->m_row_id.m_page_no == frag_page_id && 1423 req_struct->m_row_id.m_page_idx == regOperPtr.p->m_tuple_location.m_page_idx)) 1424 { 1425 ndbout_c("no mem insert but rowid (same)"); 1426 base->m_header_bits &= ~(Uint32)Tuple_header::FREE; 1427 } 1428 else 1429 { 1430 // no mem insert, but rowid > 1431 ndbrequire(false); 1432 } 1433 } [...] Trace leading to this was DBTUP 000611 000077 014126 014126 014126 014126 014126 014126 014126 014126 014126 014130 000757 003759 003874 003759 004049 003759 003946 003759 003909 001431 last signal was --------------- Signal ---------------- r.bn: 247 "DBLQH", r.proc: 13, r.sigId: 634672459 gsn: 98 "ACCKEYCONF" prio: 1 s.bn: 248 "DBACC", s.proc: 13, s.sigId: 634672457 length: 6 trace: 0 #sec: 0 fragInf: 0 H'0000443c H'00000002 H'00000003 H'0007e46e H'00000000 H'00000001 attaching sampe trace log
[30 Nov 2007 14:29]
Hartmut Holzgraefe
sample trace log
Attachment: ndb_13_trace.log.4.gz (application/x-gzip, text), 54.70 KiB.
[30 Nov 2007 16:35]
Hartmut Holzgraefe
The cluster code in the area that you were hitting the ndbrequire() error in has changed quite a bit since 5.1.11. The error you were experiencing should no longer happen with the refined code. Can you verify that the problem does no longer exist with the current MySQL 5.1.22?
[4 Dec 2007 14:34]
Paul Maunders
The old 5.1.11 cluster crashed completely this morning with some other random errors, and it wouldn't let me restart at all. Anyway, as a result of this I am upgrading our cluster to 5.1.22 rc1 with ndb_restore from our last backups. I will let you know if we encounter these errors again with this version.
[5 Dec 2007 9:16]
Paul Maunders
5.1.22 has been running for about 18 hours now, with all 4 nodes still up.
[29 Jan 2008 9:04]
Valeriy Kravchuk
So, can we close this bug as not repeatable on 5.1.22?
[29 Jan 2008 10:36]
Paul Maunders
The cluster had been running perfectly for almost 2 months, but on the 25th of January, node 14 lost connection to the other nodes and was shut down by the arbitrator. I'm not sure if this is related? Time: Friday 25 January 2008 - 15:54:22 Status: Temporary error, restart node Message: Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s) (Arbitration error) Error: 2305 Error data: Arbitrator decided to shutdown this node Error object: QMGR (Line: 4897) 0x0000000a Program: ndbd Pid: 10635 Trace: /var/lib/mysql-cluster/ndb_14_trace.log.15 Version: Version 5.1.22 (rc) ***EOM*** 2008-01-25 15:25:17 [MgmSrvr] WARNING -- Node 13: Node 24 missed heartbeat 2 2008-01-25 15:25:17 [MgmSrvr] WARNING -- Node 12: Node 24 missed heartbeat 2 2008-01-25 15:39:45 [MgmSrvr] WARNING -- Node 11: Node 24 missed heartbeat 2 2008-01-25 15:43:21 [MgmSrvr] WARNING -- Node 12: Node 24 missed heartbeat 2 2008-01-25 15:46:58 [MgmSrvr] WARNING -- Node 13: Node 24 missed heartbeat 2 2008-01-25 15:49:57 [MgmSrvr] WARNING -- Node 11: Node 24 missed heartbeat 2 2008-01-25 15:49:57 [MgmSrvr] WARNING -- Node 13: Node 24 missed heartbeat 2 2008-01-25 15:54:01 [MgmSrvr] WARNING -- Node 14: Node 24 missed heartbeat 2 2008-01-25 15:54:01 [MgmSrvr] WARNING -- Node 11: Node 24 missed heartbeat 2 2008-01-25 15:54:01 [MgmSrvr] WARNING -- Node 13: Node 24 missed heartbeat 2 2008-01-25 15:54:02 [MgmSrvr] WARNING -- Node 12: Node 24 missed heartbeat 2 2008-01-25 15:54:02 [MgmSrvr] WARNING -- Node 14: Node 24 missed heartbeat 3 2008-01-25 15:54:03 [MgmSrvr] WARNING -- Node 11: Node 24 missed heartbeat 3 2008-01-25 15:54:03 [MgmSrvr] WARNING -- Node 13: Node 24 missed heartbeat 3 2008-01-25 15:54:03 [MgmSrvr] WARNING -- Node 12: Node 24 missed heartbeat 3 2008-01-25 15:54:04 [MgmSrvr] WARNING -- Node 11: Node 24 missed heartbeat 4 2008-01-25 15:54:04 [MgmSrvr] ALERT -- Node 11: Node 24 declared dead due to missed heartbeat 2008-01-25 15:54:04 [MgmSrvr] INFO -- Node 11: Communication to Node 24 closed 2008-01-25 15:54:04 [MgmSrvr] ALERT -- Node 13: Node 24 Disconnected 2008-01-25 15:54:04 [MgmSrvr] INFO -- Node 13: Communication to Node 24 closed 2008-01-25 15:54:04 [MgmSrvr] ALERT -- Node 12: Node 24 Disconnected 2008-01-25 15:54:04 [MgmSrvr] INFO -- Node 12: Communication to Node 24 closed 2008-01-25 15:54:04 [MgmSrvr] ALERT -- Node 12: Node 24 Disconnected 2008-01-25 15:54:04 [MgmSrvr] ALERT -- Node 13: Node 24 Disconnected 2008-01-25 15:54:04 [MgmSrvr] ALERT -- Node 11: Node 24 Disconnected 2008-01-25 15:54:07 [MgmSrvr] WARNING -- Node 11: Node 14 missed heartbeat 2 2008-01-25 15:54:08 [MgmSrvr] INFO -- Node 11: Communication to Node 24 opened 2008-01-25 15:54:08 [MgmSrvr] INFO -- Node 13: Communication to Node 24 opened 2008-01-25 15:54:08 [MgmSrvr] INFO -- Node 12: Communication to Node 24 opened 2008-01-25 15:54:09 [MgmSrvr] WARNING -- Node 11: Node 14 missed heartbeat 3 2008-01-25 15:54:10 [MgmSrvr] ALERT -- Node 1: Node 14 Disconnected 2008-01-25 15:54:10 [MgmSrvr] ALERT -- Node 1: Node 14 Disconnected 2008-01-25 15:54:10 [MgmSrvr] WARNING -- Node 11: Node 14 missed heartbeat 4 2008-01-25 15:54:10 [MgmSrvr] ALERT -- Node 11: Node 14 declared dead due to missed heartbeat 2008-01-25 15:54:10 [MgmSrvr] INFO -- Node 11: Communication to Node 14 closed 2008-01-25 15:54:10 [MgmSrvr] INFO -- Node 13: Communication to Node 14 closed 2008-01-25 15:54:10 [MgmSrvr] INFO -- Node 12: Communication to Node 14 closed
[16 Feb 2008 15:02]
Valeriy Kravchuk
Please, try to repeat with a newer version, 5.1.23, and inform about the results.
[17 Mar 2008 0:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".