Bug #43053 Parallel node-recovery can be serialized by LCP
Submitted: 20 Feb 2009 10:13 Modified: 20 Feb 2009 14:55
Reporter: Jonas Oreland Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:>= 6.3 OS:Any
Assigned to: Jonas Oreland CPU Architecture:Any

[20 Feb 2009 10:13] Jonas Oreland
Description:
Parallel node recovery as impl. in 6.3 is
- serial when copying/syncing dictionary/checkpoint information
- parallel when copying/syncing data

The syncing of dictionary/checkpoint information can also not run in parallel
with local checkpoint.

So, if several nodes start in parallel, they can be serialized to wait for LCP
if LCP is running/starting continuously.

This can impose extra restart times.

How to repeat:
see above

Suggested fix:
This fix, introduces a new parameter MaxLCPStartDelay (in seconds)
which makes LCP to be delayed up to this value (if there are nodes waiting), 
to allow several nodes to be started and sync dictionary/checkpoint information.
[20 Feb 2009 10:22] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/67009

2870 Jonas Oreland	2009-02-20
      ndb - bug#43053 - introduce parameter that can delay LCP to allow for "more" parallel node recovery
[20 Feb 2009 10:38] Bugs System
Pushed into 5.1.32-ndb-6.3.23 (revid:jonas@mysql.com-20090220102059-h4gkj6mio06bhtwn) (version source revid:jonas@mysql.com-20090220102059-h4gkj6mio06bhtwn) (merge vers: 5.1.32-ndb-6.3.23) (pib:6)
[20 Feb 2009 10:39] Bugs System
Pushed into 5.1.32-ndb-6.4.3 (revid:jonas@mysql.com-20090220103615-q13lhmhbzdri4u4t) (version source revid:jonas@mysql.com-20090220103615-q13lhmhbzdri4u4t) (merge vers: 5.1.32-ndb-6.4.3) (pib:6)
[20 Feb 2009 14:55] Jon Stephens
Documented in the NDB-6.3.23 and 6.4.3 changelogs as follows:

        A new data node configuration parameter MaxLCPStartDelay has
        been introduced to facilitate parallel node recovery by causing
        a local checkpoint to be delayed while recovering nodes are
        synchronizing data dictionaries and other meta-information. For
        more information about this parameter, see "Defining MySQL Cluster 
        Data Nodes" 
(http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndbd-definition.html).