Bug #76113 Fail in ndbrequire after receiving LCP_COMPLETE_REP
Submitted: 2 Mar 2015 20:11 Modified: 16 Mar 2015 17:18
Reporter: Mikael Ronström Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:7.4.4 OS:Any
Assigned to: CPU Architecture:Any

[2 Mar 2015 20:11] Mikael Ronström
Description:
In a node restart we can fail in an ndbrequire that verifies that SYSFILE->latestLCP_ID is equal to
the LCP id sent in the LCP_COMPLETE_REP.

This is currently not necessarily true since we only update the SYSFILE->latestLCP_ID in non-master
nodes when sending out COPY_GCIREQ at LCPs and GCPs. If there is no GCP completed between the
START_LCP_REQ of a pause LCP and the LCP_COMPLETE_REP then we will hit this ndbrequire.

How to repeat:
Various tests in autotest, e.g.
testRestartGci T6 D1 
or
testNodeRestart -n NodeFailGCPOpen T1 

quite rare, so not very often failing

Suggested fix:
Update SYSFILE->latestLCP_ID in START_LCP_REQ after pause LCP
[16 Mar 2015 17:18] Jon Stephens
Documented fix as follows in the NDB 7.4.5 changelog:

    During a node restart, if there was no global checkpoint
    completed between the START_LCP_REQ of a local checkpoint and
    the LCP_COMPLETE_REP it was possible for a check of the LCP ID
    sent in the LCP_COMPLETE_REP signal with the internal value
    SYSFILE->latestLCP_ID to fail.
      
Closed.