Bug #76741 LCP + COPY_FRAGREQ not reserving all its resources at startup for scans
Submitted: 17 Apr 2015 14:05 Modified: 14 Jan 2016 22:35
Reporter: Mikael Ronström Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:7.4.7 OS:Any
Assigned to: CPU Architecture:Any

[17 Apr 2015 14:05] Mikael Ronström
Description:
It's possible to crash in LCP due to out of operation records. The same holds when
running COPY_FRAGREQ whereas here it is the starting node that will crash due to
this problem.

The same is true for LCPs where we haven't allocated segments for keeping the
ACC pointers on beforehand. Also there is some old code remaining for this
that is no longer used but can still crash LCPs.

This is the BUG#69994 as reported by the community

How to repeat:
Run sufficiently many concurrent operations in parallel with LCP.

Suggested fix:
Ensure that LCP and COPY_FRAG uses reserved resources, this is already true for Scan records,
but also need to be true for operation records and segments for ACC pointers.

Also remove old code for booked ACC operations.
[13 Jan 2016 14:06] Jon Stephens
Bug#69994 is a duplicate of this bug.
[14 Jan 2016 22:35] Jon Stephens
Documented changes in the NDB 7.2.21/7.3.10/7.4.7 changelogs, as follows:

     * A number of improvements, listed here, have been made
       with regard to handling issues that could arise when an
       overload arose due to a great number of inserts being
       performed during a local checkpoint (LCP):

          + Failures sometimes occurred during restart
            processing when trying to execute the undo log, due
            to a problem with finding the end of the log. This
            happened when there remained unwritten pages at the
            end of the first undo file when writing to the
            second undo file, which caused the execution of undo
            logs in reverse order and so execute old or even
            nonexistent log records.
            This is fixed by ensuring that execution of the undo
            log begins with the proper end of the log, and, if
            started earlier, that any unwritten or faulty pages
            are ignored.

          + It was possible to fail during an LCP, or when
            performing a COPY_FRAGREQ, due to running out of
            operation records. We fix this by making sure that
            LCPs and COPY_FRAG use resources reserved for
            operation records, as was already the case with scan
            records. In addition, old code for ACC operations
            that was no longer required but that could lead to
            failures was removed.

          + When an LCP was performed while loading a table, it
            was possible to hit a livelock during LCP scans, due
            to the fact that that each record that was inserted
            into new pages after the LCP had started had its
            LCP_SKIP flag set. Such records were discarded as
            intended by the LCP scan, but when inserts occurred
            faster than the LCP scan could discard records, the
            scan appeared to hang. As part of this issue, the
            scan failed to report any progress to the LCP
            watchdog, which after 70 seconds of livelock killed
            the process. This issue was observed when performing
            on the order of 250000 inserts per second over an
            extended period of time (120 seconds or more), using
            a single LDM.
            This part of the fix makes a number of changes,
            listed here:
               o We now ensure that pages created after the LCP
                 has started are not included in LCP scans; we
                 also ensure that no records inserted into those
                 pages have their LCP_SKIP flag set.
               o Handling of the scan protocol is changed such
                 that a certain amount of progress is made by
                 the LCP regardless of load; we now report
                 progress to the LCP watchdog so that we avoid
                 failure in in the event that an LCP is making
                 progress but not writing any records.
               o We now take steps to guarantee that LCP scans
                 proceed more quickly than inserts can occur, by
                 ensuring that scans are prioritized this
                 scanning activity, and thus, that the LCP is in
                 fact (eventually) completed.
               o In addition, scanning is made more efficient,
                 by prefetching tuples; this helps avoid stalls
                 while fetching memory in the CPU.

          + Row checksums for preventing data corruption now
            include the tuple header bits.

Closed.

NB: This also fixes BUG#76742, BUG#76373, BUG#76883.