Bug #102651 Crash in CONTINUEB when REDO log problem
Submitted: 18 Feb 2021 18:18 Modified: 17 Mar 2021 18:08
Reporter: Mikael Ronström Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:8.0.23 OS:Any
Assigned to: CPU Architecture:Any

[18 Feb 2021 18:18] Mikael Ronström
Description:
When getting a REDO log problem it is possible that we send a CONTINUEB with a case that doesn't exist and we also miss unlocking the log part.

This is the problematic code with the changes:
    if ((logPartPtr.p->m_log_problems &
         LogPartRecord::P_FILE_CHANGE_PROBLEM)!= 0)
    {
      jam();
ADDED      unlock_log_part(logPartPtr.p);
      g_eventLogger->info("LDM(%u): Gci record write is waiting for "
                          "the redo log file to be changed: "
                          "logpart: %u log part state: %u "
                          "log part problem: %u "
                          "file: %u ref %u logFileStatus %u"
                          "fileChangeState %u "
                          "current mbyte: %u "
                          "logPagePtr.i %u ",
                          instance(),
                          logPartPtr.p->logPartNo,
                          logPartPtr.p->logPartState,
                          logPartPtr.p->m_log_problems,
                          logFilePtr.p->fileNo,
                          logFilePtr.p->fileRef,
                          logFilePtr.p->logFileStatus,
                          logFilePtr.p->fileChangeState,
                          logFilePtr.p->currentMbyte,
                          logPagePtr.i);
      /* Wait for current file to be ready for writes */
ADDED      signal->theData[0] = ZTIME_SUPERVISION;
ADDED      signal->theData[1] = logPartPtr.i;
      sendSignalWithDelay(cownref, GSN_CONTINUEB, signal, 50, 2);
      return;
    }

How to repeat:
Run sysbench with a too small REDO log

Suggested fix:
See above
[22 Feb 2021 16:33] MySQL Verification Team
Hi Mikael,

Thanks for the report and the fix.

all best
Bogdan
[17 Mar 2021 18:08] Jon Stephens
Documented fix as follows in the NDB 8.0.25 changelog:

    To ensure that the log records kept for the redo log in main
    memory are written to redo log file within one second, a time
    supervisor in DBLQH acquires a lock on the redo log part prior
    to the write. A fix for a previous issue caused a continueB
    signal (introduced as part of that fix) to be sent when the redo
    log file was not yet opened and ready for the write, then to
    return without releasing the lock. Now such cases we release the
    acquired lock before waiting for the redo log file to be open
    and ready for the write.

Closed.