Description:
When an UPDATE or DELETE is the first operation of a multi-row operation during
Copy fragment and the primary key is not a match (in handle_nr_copy) this causes
the operation to use AC_IGNORED path meaning that the operation will be ignored.
Later an INSERT followed by a DELETE occurs on the same row and in the same
transaction. These are both decided to be fully executed on the starting node.
Next the UPDATE/DELETE reaches the backup for commit, this is ignored and
passed directly to the primary replica. This causes the locks on the row to
remain. When the UPDATE/DELETE reaches the primary replica, all row operations
are committed and the row is dropped. This leads the lock queue to be there but
not connected to any row and thus a new transaction with an INSERT can start in
the primary replica. If this INSERT reaches the starting node before the Commit of
the INSERT in the multi-row transaction then a crash will happen.
How to repeat:
testIndex -r 10 -n NF_Mixed T1 T6 T13
Quite difficult to reproduce.
For safe reproduction one needs to send the first Commit on a row and then
delay the sending of the second Commit message while performing multi-row
operations. So the above test case with some added ERROR_INSERTs should do the trick.
Suggested fix:
The code in handle_nr_copy is unnecessarily complex. In the primary we could keep track
of whether a row operation should be ignored or not and set this as a flag in LQHKEYREQ.
There are also other ways to solve the issue, but these seem to be adding complexity whereas
the above decreases complexity.