Description:
The variable m_copy_started_state has 3 states in the starting node:
AC_IGNORED: Set before copy from live node starts. In this state all writes are ignored but the node participates in the transactions.
AC_NR_COPY: Set when first copy row arrives. In this state all LQHKEYREQ are sent using RowId.
AC_NORMAL: Normal operation, set when COPY_ACTIVEREQ is received from master node.
Inserts are sent with RowId even in normal operation, so if they arrive after copying completed, but before COPY_ACTIVEREQ arrives we cannot tell if it is in Copy phase or normal phase. This can lead to a crash in the case of a DELETE followed by an INSERT. It can lead to lost row in starting node if the INSERTs RowId is a new RowId outside of the range currently copied.
How to repeat:
Test case: testIndex -n DeferredMixedLoadError --skip-ndb-optimized-node-selection T1 T6 T13
Run it on an overloaded machine or even better delay the signal COPY_ACTIVEREQ to the starting node or delay COPY_FRAGCONF from live node to master.
Suggested fix:
Send a new signal to inform about the finished copying directly from live node to starting node. Thus disallow races to occur. This bug is a classic race where a signal from A -> B -> C is raced by a later signal sent from A -> C. Thus opening a small window of bad things to occur.