MySQL Bugs: #69740: ndbd crashes during start

Bug #69740	ndbd crashes during start
Submitted:	13 Jul 2013 18:09	Modified:	6 Sep 2017 12:28
Reporter:	Dirar Abu-Saymeh	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Disk Data	Severity:	S1 (Critical)
Version:	7.2.13	OS:	Linux
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
ndbd is crashing during start. I was running 7.2.8, and I also upgraded to 7.2.13 to see if this fixes it. But the crash happens in both.

below is what I see in the error log.

I was not able to use nab_error_reported since it not able to get to the data nodes (I use a non-standard ssh port).

Time: Saturday 13 July 2013 - 17:11:01
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DblqhMain.cpp
Error object: DBLQH (Line: 18158) 0x00000002
Program: ndbd
Pid: 16924
Version: mysql-5.5.27 ndb-7.2.8
Trace: /disk2/mysql-cluster/ndb_3_trace.log.5 [t1..t1]
***EOM***

Time: Saturday 13 July 2013 - 17:58:46
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DblqhMain.cpp
Error object: DBLQH (Line: 18366) 0x00000002
Program: ndbd
Pid: 17861
Version: mysql-5.5.31 ndb-7.2.13
Trace: /disk2/mysql-cluster/ndb_3_trace.log.7 [t1..t1]
***EOM***

How to repeat:
Not sure how you can repeat it. But for me, I can repeat it, but just starting ndbd.

Made an error with severity.

data node trace and out log files are probably needed to diagnose this,

also: does this happen on a fresh cluster start, or on a cluster that is already populated with data?

This happens on one of the data nodes. It has data in it already.

Assertion failure happens in this function:

  /* --------------------------------------------------------------------------
   *       IT IS NOW TIME TO FIND WHERE TO START EXECUTING THE LOG.
   *       THIS SIGNAL IS SENT FOR EACH LOG PART AND STARTS THE EXECUTION 
   *       OF THE LOG FOR THIS PART.
   *-------------------------------------------------------------------------- 
  */
  void Dblqh::srLogLimits(Signal* signal)

on this assertion check:

  18157       if (logPartPtr.p->lastLogfile == logFilePtr.i) {
  18158 *       ndbrequire(logPartPtr.p->lastMbyte != tmbyte);
  18159       }//if

Not sure what this is exactly checking for, but it looks as if
"the log" (Redo log?) is corrupted, and there is probably now
way around this that would fix the situation besides setting
up the cluster from scratch and restoring the most recent
backup ...?

I have reinitized the data node and seemed to run for a while. It has now crashed again after 10 days. Seems to be a different bug. Reported it as bug number 69822.

We have seen a similar crash in our test runs occasionally, but it happens extremely seldom, so has still eluded us.

cannot reproduce on any of the "modern" releases of mccge