Bug #52135 Take over (of master) during system restart leads to DICT error
Submitted: 17 Mar 2010 12:12 Modified: 17 Mar 2010 16:23
Reporter: Jonas Oreland Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.1-telco-6.3 OS:Any
Assigned to: Jonas Oreland CPU Architecture:Any

[17 Mar 2010 12:12] Jonas Oreland
Description:
Performing a complicated mix of node/system restarts,
sometimes it can be that elected master is missing REDO
so that it needs (optimized) node-recovery.

Iff this happened, DICT would crash with
<quote>
Error data: Failure to recreate object X during restart, error 721. Check configuration changes and instructions from 'perror --ndb 721'
</quote>

This as DICT restart code was being run twice!

How to repeat:
"testSystemRestart -n Bug48436" produces it sporadically.

Suggested fix:
run TO directly (instead of waiting until rest of nodes started)
if this happens.
[17 Mar 2010 12:23] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/103571

3154 Jonas Oreland	2010-03-17
      ndb - fix bug#52135 - TO of master! during SR
[17 Mar 2010 12:52] Bugs System
Pushed into 5.1.44-ndb-6.3.33 (revid:jonas@mysql.com-20100317122157-fyqy826tgsi6wtin) (version source revid:jonas@mysql.com-20100317121123-w4sms01noqkx4o65) (merge vers: 5.1.44-ndb-6.3.33) (pib:16)
[17 Mar 2010 12:52] Bugs System
Pushed into 5.1.44-ndb-7.0.14 (revid:jonas@mysql.com-20100317123740-n3dvpvoa2p9x7oq6) (version source revid:jonas@mysql.com-20100317123551-tye1spfcw9u2ayep) (merge vers: 5.1.44-ndb-7.0.14) (pib:16)
[17 Mar 2010 12:53] Jonas Oreland
pushed to 6.3.33,7.0.14 and 7.1.3
[17 Mar 2010 16:23] Jon Stephens
Documented in the NDB-6.3.33, 7.0.14, and 7.1.3 changelogs, as follows:

        When performing a complex mix of node restarts and system
        restarts, a node that was elected as master sometimes required
        optimized node-recovery due to missing REDO information. When
        this happened, the node crashed with Failure to recreate object
        ... during restart, error 721 (because the DBDICT restart code
        was run twice). Now when this occurs, node takeover is executed
        immediately, rather than being made to wait until the remaining
        data nodes have started.

Closed.