MySQL Bugs: #117209: Racce condition on m_copy_started

Bug #117209	Racce condition on m_copy_started_state in starting node
Submitted:	15 Jan 12:20	Modified:	15 Jan 12:42
Reporter:	Mikael Ronström	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	8.4.3	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
The variable m_copy_started_state has 3 states in the starting node:
AC_IGNORED: Set before copy from live node starts. In this state all writes are ignored but the node participates in the transactions.
AC_NR_COPY: Set when first copy row arrives. In this state all LQHKEYREQ are sent using RowId.
AC_NORMAL: Normal operation, set when COPY_ACTIVEREQ is received from master node.

Inserts are sent with RowId even in normal operation, so if they arrive after copying completed, but before COPY_ACTIVEREQ arrives we cannot tell if it is in Copy phase or normal phase. This can lead to a crash in the case of a DELETE followed by an INSERT. It can lead to lost row in starting node if the INSERTs RowId is a new RowId outside of the range currently copied.

How to repeat:
Test case: testIndex -n DeferredMixedLoadError --skip-ndb-optimized-node-selection T1 T6 T13

Run it on an overloaded machine or even better delay the signal COPY_ACTIVEREQ to the starting node or delay COPY_FRAGCONF from live node to master.

Suggested fix:
Send a new signal to inform about the finished copying directly from live node to starting node. Thus disallow races to occur. This bug is a classic race where a signal from A -> B -> C is raced by a later signal sent from A -> C. Thus opening a small window of bad things to occur.

Hello Mikael,

Thank you for the report and feedback.

Sincerely,
Umesh