MySQL Bugs: #102651: Crash in CONTINUEB when REDO log problem

Bug #102651	Crash in CONTINUEB when REDO log problem
Submitted:	18 Feb 2021 18:18	Modified:	17 Mar 2021 18:08
Reporter:	Mikael Ronström	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	8.0.23	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
When getting a REDO log problem it is possible that we send a CONTINUEB with a case that doesn't exist and we also miss unlocking the log part.

This is the problematic code with the changes:
    if ((logPartPtr.p->m_log_problems &
         LogPartRecord::P_FILE_CHANGE_PROBLEM)!= 0)
    {
      jam();
ADDED      unlock_log_part(logPartPtr.p);
      g_eventLogger->info("LDM(%u): Gci record write is waiting for "
                          "the redo log file to be changed: "
                          "logpart: %u log part state: %u "
                          "log part problem: %u "
                          "file: %u ref %u logFileStatus %u"
                          "fileChangeState %u "
                          "current mbyte: %u "
                          "logPagePtr.i %u ",
                          instance(),
                          logPartPtr.p->logPartNo,
                          logPartPtr.p->logPartState,
                          logPartPtr.p->m_log_problems,
                          logFilePtr.p->fileNo,
                          logFilePtr.p->fileRef,
                          logFilePtr.p->logFileStatus,
                          logFilePtr.p->fileChangeState,
                          logFilePtr.p->currentMbyte,
                          logPagePtr.i);
      /* Wait for current file to be ready for writes */
ADDED      signal->theData[0] = ZTIME_SUPERVISION;
ADDED      signal->theData[1] = logPartPtr.i;
      sendSignalWithDelay(cownref, GSN_CONTINUEB, signal, 50, 2);
      return;
    }

How to repeat:
Run sysbench with a too small REDO log

Suggested fix:
See above

Hi Mikael,

Thanks for the report and the fix.

all best
Bogdan

Documented fix as follows in the NDB 8.0.25 changelog:

    To ensure that the log records kept for the redo log in main
    memory are written to redo log file within one second, a time
    supervisor in DBLQH acquires a lock on the redo log part prior
    to the write. A fix for a previous issue caused a continueB
    signal (introduced as part of that fix) to be sent when the redo
    log file was not yet opened and ready for the write, then to
    return without releasing the lock. Now such cases we release the
    acquired lock before waiting for the redo log file to be open
    and ready for the write.

Closed.