Bug #117209 Racce condition on m_copy_started_state in starting node
Submitted: 15 Jan 12:20 Modified: 15 Jan 12:42
Reporter: Mikael Ronström Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:8.4.3 OS:Any
Assigned to: CPU Architecture:Any

[15 Jan 12:20] Mikael Ronström
Description:
The variable m_copy_started_state has 3 states in the starting node:
AC_IGNORED: Set before copy from live node starts. In this state all writes are ignored but the node participates in the transactions.
AC_NR_COPY: Set when first copy row arrives. In this state all LQHKEYREQ are sent using RowId.
AC_NORMAL: Normal operation, set when COPY_ACTIVEREQ is received from master node.

Inserts are sent with RowId even in normal operation, so if they arrive after copying completed, but before COPY_ACTIVEREQ arrives we cannot tell if it is in Copy phase or normal phase. This can lead to a crash in the case of a DELETE followed by an INSERT. It can lead to lost row in starting node if the INSERTs RowId is a new RowId outside of the range currently copied.

How to repeat:
Test case: testIndex -n DeferredMixedLoadError --skip-ndb-optimized-node-selection T1 T6 T13

Run it on an overloaded machine or even better delay the signal COPY_ACTIVEREQ to the starting node or delay COPY_FRAGCONF from live node to master.

Suggested fix:
Send a new signal to inform about the finished copying directly from live node to starting node. Thus disallow races to occur. This bug is a classic race where a signal from A -> B -> C is raced by a later signal sent from A -> C. Thus opening a small window of bad things to occur.
[15 Jan 12:42] MySQL Verification Team
Hello Mikael,

Thank you for the report and feedback.

Sincerely,
Umesh