| Bug #76741 | LCP + COPY_FRAGREQ not reserving all its resources at startup for scans | ||
|---|---|---|---|
| Submitted: | 17 Apr 2015 14:05 | Modified: | 14 Jan 2016 22:35 |
| Reporter: | Mikael Ronström | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
| Version: | 7.4.7 | OS: | Any |
| Assigned to: | CPU Architecture: | Any | |
[17 Apr 2015 14:05]
Mikael Ronström
[13 Jan 2016 14:06]
Jon Stephens
Bug#69994 is a duplicate of this bug.
[14 Jan 2016 22:35]
Jon Stephens
Documented changes in the NDB 7.2.21/7.3.10/7.4.7 changelogs, as follows:
* A number of improvements, listed here, have been made
with regard to handling issues that could arise when an
overload arose due to a great number of inserts being
performed during a local checkpoint (LCP):
+ Failures sometimes occurred during restart
processing when trying to execute the undo log, due
to a problem with finding the end of the log. This
happened when there remained unwritten pages at the
end of the first undo file when writing to the
second undo file, which caused the execution of undo
logs in reverse order and so execute old or even
nonexistent log records.
This is fixed by ensuring that execution of the undo
log begins with the proper end of the log, and, if
started earlier, that any unwritten or faulty pages
are ignored.
+ It was possible to fail during an LCP, or when
performing a COPY_FRAGREQ, due to running out of
operation records. We fix this by making sure that
LCPs and COPY_FRAG use resources reserved for
operation records, as was already the case with scan
records. In addition, old code for ACC operations
that was no longer required but that could lead to
failures was removed.
+ When an LCP was performed while loading a table, it
was possible to hit a livelock during LCP scans, due
to the fact that that each record that was inserted
into new pages after the LCP had started had its
LCP_SKIP flag set. Such records were discarded as
intended by the LCP scan, but when inserts occurred
faster than the LCP scan could discard records, the
scan appeared to hang. As part of this issue, the
scan failed to report any progress to the LCP
watchdog, which after 70 seconds of livelock killed
the process. This issue was observed when performing
on the order of 250000 inserts per second over an
extended period of time (120 seconds or more), using
a single LDM.
This part of the fix makes a number of changes,
listed here:
o We now ensure that pages created after the LCP
has started are not included in LCP scans; we
also ensure that no records inserted into those
pages have their LCP_SKIP flag set.
o Handling of the scan protocol is changed such
that a certain amount of progress is made by
the LCP regardless of load; we now report
progress to the LCP watchdog so that we avoid
failure in in the event that an LCP is making
progress but not writing any records.
o We now take steps to guarantee that LCP scans
proceed more quickly than inserts can occur, by
ensuring that scans are prioritized this
scanning activity, and thus, that the LCP is in
fact (eventually) completed.
o In addition, scanning is made more efficient,
by prefetching tuples; this helps avoid stalls
while fetching memory in the CPU.
+ Row checksums for preventing data corruption now
include the tuple header bits.
Closed.
NB: This also fixes BUG#76742, BUG#76373, BUG#76883.
