Bug #76741 | LCP + COPY_FRAGREQ not reserving all its resources at startup for scans | ||
---|---|---|---|
Submitted: | 17 Apr 2015 14:05 | Modified: | 14 Jan 2016 22:35 |
Reporter: | Mikael Ronström | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
Version: | 7.4.7 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[17 Apr 2015 14:05]
Mikael Ronström
[13 Jan 2016 14:06]
Jon Stephens
Bug#69994 is a duplicate of this bug.
[14 Jan 2016 22:35]
Jon Stephens
Documented changes in the NDB 7.2.21/7.3.10/7.4.7 changelogs, as follows: * A number of improvements, listed here, have been made with regard to handling issues that could arise when an overload arose due to a great number of inserts being performed during a local checkpoint (LCP): + Failures sometimes occurred during restart processing when trying to execute the undo log, due to a problem with finding the end of the log. This happened when there remained unwritten pages at the end of the first undo file when writing to the second undo file, which caused the execution of undo logs in reverse order and so execute old or even nonexistent log records. This is fixed by ensuring that execution of the undo log begins with the proper end of the log, and, if started earlier, that any unwritten or faulty pages are ignored. + It was possible to fail during an LCP, or when performing a COPY_FRAGREQ, due to running out of operation records. We fix this by making sure that LCPs and COPY_FRAG use resources reserved for operation records, as was already the case with scan records. In addition, old code for ACC operations that was no longer required but that could lead to failures was removed. + When an LCP was performed while loading a table, it was possible to hit a livelock during LCP scans, due to the fact that that each record that was inserted into new pages after the LCP had started had its LCP_SKIP flag set. Such records were discarded as intended by the LCP scan, but when inserts occurred faster than the LCP scan could discard records, the scan appeared to hang. As part of this issue, the scan failed to report any progress to the LCP watchdog, which after 70 seconds of livelock killed the process. This issue was observed when performing on the order of 250000 inserts per second over an extended period of time (120 seconds or more), using a single LDM. This part of the fix makes a number of changes, listed here: o We now ensure that pages created after the LCP has started are not included in LCP scans; we also ensure that no records inserted into those pages have their LCP_SKIP flag set. o Handling of the scan protocol is changed such that a certain amount of progress is made by the LCP regardless of load; we now report progress to the LCP watchdog so that we avoid failure in in the event that an LCP is making progress but not writing any records. o We now take steps to guarantee that LCP scans proceed more quickly than inserts can occur, by ensuring that scans are prioritized this scanning activity, and thus, that the LCP is in fact (eventually) completed. o In addition, scanning is made more efficient, by prefetching tuples; this helps avoid stalls while fetching memory in the CPU. + Row checksums for preventing data corruption now include the tuple header bits. Closed. NB: This also fixes BUG#76742, BUG#76373, BUG#76883.