MySQL Bugs: #76113: Fail in ndbrequire after receiving LCP_COMPLETE

Bug #76113	Fail in ndbrequire after receiving LCP_COMPLETE_REP
Submitted:	2 Mar 2015 20:11	Modified:	16 Mar 2015 17:18
Reporter:	Mikael Ronström	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	7.4.4	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
In a node restart we can fail in an ndbrequire that verifies that SYSFILE->latestLCP_ID is equal to
the LCP id sent in the LCP_COMPLETE_REP.

This is currently not necessarily true since we only update the SYSFILE->latestLCP_ID in non-master
nodes when sending out COPY_GCIREQ at LCPs and GCPs. If there is no GCP completed between the
START_LCP_REQ of a pause LCP and the LCP_COMPLETE_REP then we will hit this ndbrequire.

How to repeat:
Various tests in autotest, e.g.
testRestartGci T6 D1 
or
testNodeRestart -n NodeFailGCPOpen T1 

quite rare, so not very often failing

Suggested fix:
Update SYSFILE->latestLCP_ID in START_LCP_REQ after pause LCP

Documented fix as follows in the NDB 7.4.5 changelog:

    During a node restart, if there was no global checkpoint
    completed between the START_LCP_REQ of a local checkpoint and
    the LCP_COMPLETE_REP it was possible for a check of the LCP ID
    sent in the LCP_COMPLETE_REP signal with the internal value
    SYSFILE->latestLCP_ID to fail.
      
Closed.