MySQL Bugs: #26907: random temporary errors causing ndb node failure

Bug #26907	random temporary errors causing ndb node failure
Submitted:	7 Mar 2007 8:57	Modified:	16 Mar 2008 15:02
Reporter:	Paul Maunders	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	5.1.11	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
One of the nodes in our 4 node cluster has been crashing randomly, and the error messages suggest we should report a bug:

Time: Tuesday 20 February 2007 - 18:29:46
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug)
Error: 2303
Error data: Node 11 killed this node because it could not copy a fragment during node restart. Copy fragment er
Error object: NDBCNTR (Line: 229) 0x0000000a
Program: ndbd
Pid: 11801
Trace: /var/lib/mysql-cluster/ndb_13_trace.log.7

Version: Version 5.1.11 (
Time: Friday 23 February 2007 - 23:12:11
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbtup/DbtupExecQuery.cpp
Error object: DBTUP (Line: 1431) 0x0000000a
Program: ndbd
Pid: 3081
Trace: /var/lib/mysql-cluster/ndb_13_trace.log.8
Version: Version 5.1.11 (beta)
***EOM*** 

How to repeat:
Occurs randomly under normal usage.

Hi,

to analyze this kind of errors we need the full set of cluster log files
(cluster log, node error logs, node trace files, node out.log files) and
the cluster configuration file.

You may find the ndb_error_reporter tool usefull in cases like this:
http://dev.mysql.com/doc/refman/5.0/en/mysql-cluster-utilities-ndb-error-reporter.html

ok thanks, how do I install that? It doesn't seem to be part of the ndb tools that   came with the rpm I used to install the cluster initially.

These are all the rpms I have installed:

MySQL-devel-5.1.11-0.glibc23
MySQL-ndb-tools-5.1.11-0.glibc23
MySQL-client-5.1.11-0.glibc23
MySQL-ndb-extra-5.1.11-0.glibc23
MySQL-ndb-storage-5.1.11-0.glibc23
MySQL-server-5.1.11-0.glibc23
MySQL-shared-5.1.11-0.glibc23
MySQL-ndb-management-5.1.11-0.glibc23

I notice according to this bug, they were not added until 5.1.12: http://bugs.mysql.com/bug.php?id=20426

Since I am using 5.1.11, how can I manually install them?

In the absence of ndb_error_reporter just collect
all ndb_*.log and nbd_*trace.log.* files manually,
ndb_error_reporter is just a convenient way to
automate this process.

The same error has re-occured twice on a different node in our cluster:

Time: Thursday 8 Mars 2007 - 11:33:53
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbtup/DbtupExecQuery.cpp
Error object: DBTUP (Line: 1431) 0x0000000a
Program: ndbd
Pid: 12812
Trace: /var/lib/mysql-cluster/ndb_14_trace.log.5
Version: Version 5.1.11 (beta)
***EOM***

Time: Saturday 10 Mars 2007 - 18:16:26
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbtup/DbtupExecQuery.cpp
Error object: DBTUP (Line: 1431) 0x0000000a
Program: ndbd
Pid: 17098
Trace: /var/lib/mysql-cluster/ndb_14_trace.log.6
Version: Version 5.1.11 (beta)
***EOM***

I will now collect all the related error logs and upload them as a tar.gz

I have uploaded the logs to ftp.mysql.com 

The filename is bug-data-26907.tar.gz

This error happened again on node 13 yesterday:

Time: Tuesday 13 Mars 2007 - 11:10:54
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbtup/DbtupExecQuery.cpp
Error object: DBTUP (Line: 1431) 0x0000000a
Program: ndbd
Pid: 15370
Trace: /var/lib/mysql-cluster/ndb_13_trace.log.9
Version: Version 5.1.11 (beta)
***EOM***

Hi,

Has there been any progress on this bug?

Regards,

Paul

Failure is in DbtupExecQuery.cpp line 1431
in member function

  int Dbtup::handleInsertReq(Signal* signal,
                           Ptr<Operationrec> regOperPtr,
                           Ptr<Fragrecord> fragPtr,
                           Tablerec* regTabPtr,
                           KeyReqStruct *req_struct)

  [...]
  1420    else
  1421    {
  1422      if ((req_struct->m_row_id.m_page_no == frag_page_id &&
  1423           req_struct->m_row_id.m_page_idx == regOperPtr.p->m_tuple_location.m_page_idx))
  1424      {
  1425        ndbout_c("no mem insert but rowid (same)");
  1426        base->m_header_bits &= ~(Uint32)Tuple_header::FREE;
  1427      }
  1428      else
  1429      {
  1430        // no mem insert, but rowid
> 1431        ndbrequire(false);
  1432      }
  1433    }
  [...]

Trace leading to this was

DBTUP   000611 000077 014126 014126 014126 014126 014126 014126
        014126 014126 014126 014130 000757 003759 003874 003759
        004049 003759 003946 003759 003909 001431

last signal was

--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 13, r.sigId: 634672459 gsn: 98 "ACCKEYCONF" prio: 1
s.bn: 248 "DBACC", s.proc: 13, s.sigId: 634672457 length: 6 trace: 0 #sec: 0 fragInf: 0
 H'0000443c H'00000002 H'00000003 H'0007e46e H'00000000 H'00000001

attaching sampe trace log

sample trace log

Attachment: ndb_13_trace.log.4.gz (application/x-gzip, text), 54.70 KiB.

The cluster code in the area that you were hitting the ndbrequire() error in has changed quite a bit since 5.1.11. The error you were experiencing should no longer happen with the refined code.

Can you verify that the problem does no longer exist with the current MySQL 5.1.22?

The old 5.1.11 cluster crashed completely this morning with some other random errors, and it wouldn't let me restart at all.

Anyway, as a result of this I am upgrading our cluster to 5.1.22 rc1 with ndb_restore from our last backups.

I will let you know if we encounter these errors again with this version.

5.1.22 has been running for about 18 hours now, with all 4 nodes still up.

So, can we close this bug as not repeatable on 5.1.22?

The cluster had been running perfectly for almost 2 months, but on the 25th of January, node 14 lost connection to the other nodes and was shut down by the arbitrator. I'm not sure if this is related?

Time: Friday 25 January 2008 - 15:54:22
Status: Temporary error, restart node
Message: Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s) (Arbitration error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
Error object: QMGR (Line: 4897) 0x0000000a
Program: ndbd
Pid: 10635
Trace: /var/lib/mysql-cluster/ndb_14_trace.log.15
Version: Version 5.1.22 (rc)
***EOM***

2008-01-25 15:25:17 [MgmSrvr] WARNING  -- Node 13: Node 24 missed heartbeat 2
2008-01-25 15:25:17 [MgmSrvr] WARNING  -- Node 12: Node 24 missed heartbeat 2
2008-01-25 15:39:45 [MgmSrvr] WARNING  -- Node 11: Node 24 missed heartbeat 2
2008-01-25 15:43:21 [MgmSrvr] WARNING  -- Node 12: Node 24 missed heartbeat 2
2008-01-25 15:46:58 [MgmSrvr] WARNING  -- Node 13: Node 24 missed heartbeat 2
2008-01-25 15:49:57 [MgmSrvr] WARNING  -- Node 11: Node 24 missed heartbeat 2
2008-01-25 15:49:57 [MgmSrvr] WARNING  -- Node 13: Node 24 missed heartbeat 2
2008-01-25 15:54:01 [MgmSrvr] WARNING  -- Node 14: Node 24 missed heartbeat 2
2008-01-25 15:54:01 [MgmSrvr] WARNING  -- Node 11: Node 24 missed heartbeat 2
2008-01-25 15:54:01 [MgmSrvr] WARNING  -- Node 13: Node 24 missed heartbeat 2
2008-01-25 15:54:02 [MgmSrvr] WARNING  -- Node 12: Node 24 missed heartbeat 2
2008-01-25 15:54:02 [MgmSrvr] WARNING  -- Node 14: Node 24 missed heartbeat 3
2008-01-25 15:54:03 [MgmSrvr] WARNING  -- Node 11: Node 24 missed heartbeat 3
2008-01-25 15:54:03 [MgmSrvr] WARNING  -- Node 13: Node 24 missed heartbeat 3
2008-01-25 15:54:03 [MgmSrvr] WARNING  -- Node 12: Node 24 missed heartbeat 3
2008-01-25 15:54:04 [MgmSrvr] WARNING  -- Node 11: Node 24 missed heartbeat 4
2008-01-25 15:54:04 [MgmSrvr] ALERT    -- Node 11: Node 24 declared dead due to missed heartbeat
2008-01-25 15:54:04 [MgmSrvr] INFO     -- Node 11: Communication to Node 24 closed
2008-01-25 15:54:04 [MgmSrvr] ALERT    -- Node 13: Node 24 Disconnected
2008-01-25 15:54:04 [MgmSrvr] INFO     -- Node 13: Communication to Node 24 closed
2008-01-25 15:54:04 [MgmSrvr] ALERT    -- Node 12: Node 24 Disconnected
2008-01-25 15:54:04 [MgmSrvr] INFO     -- Node 12: Communication to Node 24 closed
2008-01-25 15:54:04 [MgmSrvr] ALERT    -- Node 12: Node 24 Disconnected
2008-01-25 15:54:04 [MgmSrvr] ALERT    -- Node 13: Node 24 Disconnected
2008-01-25 15:54:04 [MgmSrvr] ALERT    -- Node 11: Node 24 Disconnected
2008-01-25 15:54:07 [MgmSrvr] WARNING  -- Node 11: Node 14 missed heartbeat 2
2008-01-25 15:54:08 [MgmSrvr] INFO     -- Node 11: Communication to Node 24 opened
2008-01-25 15:54:08 [MgmSrvr] INFO     -- Node 13: Communication to Node 24 opened
2008-01-25 15:54:08 [MgmSrvr] INFO     -- Node 12: Communication to Node 24 opened
2008-01-25 15:54:09 [MgmSrvr] WARNING  -- Node 11: Node 14 missed heartbeat 3
2008-01-25 15:54:10 [MgmSrvr] ALERT    -- Node 1: Node 14 Disconnected
2008-01-25 15:54:10 [MgmSrvr] ALERT    -- Node 1: Node 14 Disconnected
2008-01-25 15:54:10 [MgmSrvr] WARNING  -- Node 11: Node 14 missed heartbeat 4
2008-01-25 15:54:10 [MgmSrvr] ALERT    -- Node 11: Node 14 declared dead due to missed heartbeat
2008-01-25 15:54:10 [MgmSrvr] INFO     -- Node 11: Communication to Node 14 closed
2008-01-25 15:54:10 [MgmSrvr] INFO     -- Node 13: Communication to Node 14 closed
2008-01-25 15:54:10 [MgmSrvr] INFO     -- Node 12: Communication to Node 14 closed

Please, try to repeat with a newer version, 5.1.23, and inform about the results.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".