Bug #69740 ndbd crashes during start
Submitted: 13 Jul 2013 18:09 Modified: 6 Sep 2017 12:28
Reporter: Dirar Abu-Saymeh Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Disk Data Severity:S1 (Critical)
Version:7.2.13 OS:Linux
Assigned to: MySQL Verification Team CPU Architecture:Any

[13 Jul 2013 18:09] Dirar Abu-Saymeh
Description:
ndbd is crashing during start. I was running 7.2.8, and I also upgraded to 7.2.13 to see if this fixes it. But the crash happens in both.

below is what I see in the error log.

I was not able to use nab_error_reported since it not able to get to the data nodes (I use a non-standard ssh port).

Time: Saturday 13 July 2013 - 17:11:01
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DblqhMain.cpp
Error object: DBLQH (Line: 18158) 0x00000002
Program: ndbd
Pid: 16924
Version: mysql-5.5.27 ndb-7.2.8
Trace: /disk2/mysql-cluster/ndb_3_trace.log.5 [t1..t1]
***EOM***

Time: Saturday 13 July 2013 - 17:58:46
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DblqhMain.cpp
Error object: DBLQH (Line: 18366) 0x00000002
Program: ndbd
Pid: 17861
Version: mysql-5.5.31 ndb-7.2.13
Trace: /disk2/mysql-cluster/ndb_3_trace.log.7 [t1..t1]
***EOM***

How to repeat:
Not sure how you can repeat it. But for me, I can repeat it, but just starting ndbd.
[13 Jul 2013 18:10] Dirar Abu-Saymeh
Made an error with severity.
[13 Jul 2013 19:14] Hartmut Holzgraefe
data node trace and out log files are probably needed to diagnose this,

also: does this happen on a fresh cluster start, or on a cluster that is already populated with data?
[13 Jul 2013 20:24] Dirar Abu-Saymeh
This happens on one of the data nodes. It has data in it already.
[15 Jul 2013 8:56] Hartmut Holzgraefe
Assertion failure happens in this function:

  /* --------------------------------------------------------------------------
   *       IT IS NOW TIME TO FIND WHERE TO START EXECUTING THE LOG.
   *       THIS SIGNAL IS SENT FOR EACH LOG PART AND STARTS THE EXECUTION 
   *       OF THE LOG FOR THIS PART.
   *-------------------------------------------------------------------------- 
  */
  void Dblqh::srLogLimits(Signal* signal)

on this assertion check:

  18157       if (logPartPtr.p->lastLogfile == logFilePtr.i) {
  18158 *       ndbrequire(logPartPtr.p->lastMbyte != tmbyte);
  18159       }//if

Not sure what this is exactly checking for, but it looks as if
"the log" (Redo log?) is corrupted, and there is probably now
way around this that would fix the situation besides setting
up the cluster from scratch and restoring the most recent
backup ...?
[23 Jul 2013 11:39] Dirar Abu-Saymeh
I have reinitized the data node and seemed to run for a while. It has now crashed again after 10 days. Seems to be a different bug. Reported it as bug number 69822.
[7 Jul 2015 11:42] Mikael Ronström
We have seen a similar crash in our test runs occasionally, but it happens extremely seldom, so has still eluded us.
[6 Sep 2017 12:28] MySQL Verification Team
cannot reproduce on any of the "modern" releases of mccge