Bug #74319 LCPs start when nodes are still waiting to copy distribution and dictionary
Submitted: 10 Oct 2014 13:14 Modified: 22 Dec 2014 16:38
Reporter: Mikael Ronström Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:7.4.0 OS:Any
Assigned to: CPU Architecture:Any

[10 Oct 2014 13:14] Mikael Ronström
Description:
LCPs start a bit too fast sometimes. We have two phases where we block waiting for LCPs to complete
during node restarts. First to copy data dictionary information. Second to be part of a complete LCP
before we can announce us to be ready with node restart.

When performing multiple node restarts, e.g. to perform a rolling restart then often only a subset of
the nodes pass through these "gates" between each LCPs although the activity they need sometimes
only requires a few seconds of blocking LCPs.

How to repeat:
Run an 8-node cluster and fail 4 nodes after running a flexAsynch test inserting 200M rows.

Suggested fix:
Wait with starting LCP.
Keep track of the time we execute LCPs and keep track of the time it takes to copy
the dictionary. Ensure that we don't stop the LCPs for more than e.g. 20% of the time
it takes to complete the LCP. But also ensure that the LCP isn't started if a small
activity is needed without LCP activity.
[22 Dec 2014 16:38] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

  Documented fix in the NDB 7.4.3 changelog as follows:

    Local checkpoints were sometimes started earlier than necessary during
    node restarts.

  Closed.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html