Description:
When running an LCP simultaneously as we're loading up a table we can be hit by a livelock in LCP scans.
The problem is caused by that each insert record that is inserted on new pages after the LCP started
is going to be inserted with the LCP_SKIP flag set on the record.
The LCP scan will discard those records correctly, but when the user is inserting faster than the LCP
scan is discarding records then the LCP scan will seem to be stuck. It isn't really stuck but it doesn't
report any progress to the LCP watchdog which will eventually after 70 seconds of livelock will crash
the process.
How to repeat:
Run flexAsynch with inserts going on for at least 120 seconds at a very high rate (requires
a rate of 250.000 inserts per second on a single LDM in an Intel NUC machine. So the database
size needs to be at least 250.000 * 120 * record size. So for record size of 128 bytes the
DataMemory needs to be at least 5 GB.
Suggested fix:
1) There is no reason to scan pages created after the start of the LCP, so ensure that those
pages aren't scanned during LCP scans and also ensure that no records inserted into those
pages are setting the LCP_SKIP flag.
2) Ensure that we report progress to the LCP watchdog so that we avoid crashing in a case
where LCP is making progress although not writing any records.
3) Ensure that the LCP scans moves faster than the inserting can happen by ensuring that we
prioritise this scanning activity. This will ensure that we make progress and that we will
eventually complete the LCP without having to stop insert activity.
4) Also make this case of scanning a bit faster by prefetching tuples to avoid constant
memory fetch stalls in the CPU for this scanning activity.
Can handle these as separate patches, the only one needed to fix the crash is 2). 1) will
solve the most common case of data loading as reason to find this problem. 3) will solve
also some weird cases of inserting while having a number of records still in a lot of
almost empty pages. Finally the last one is simply avoiding creating overhead by 3). So
3) and 4) is tied together from a performance point of view.