MySQL Bugs: #52135: Take over (of master) during system restart leads to DICT error

Bug #52135	Take over (of master) during system restart leads to DICT error
Submitted:	17 Mar 2010 12:12	Modified:	17 Mar 2010 16:23
Reporter:	Jonas Oreland	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	mysql-5.1-telco-6.3	OS:	Any
Assigned to:	Jonas Oreland	CPU Architecture:	Any

Description:
Performing a complicated mix of node/system restarts,
sometimes it can be that elected master is missing REDO
so that it needs (optimized) node-recovery.

Iff this happened, DICT would crash with
<quote>
Error data: Failure to recreate object X during restart, error 721. Check configuration changes and instructions from 'perror --ndb 721'
</quote>

This as DICT restart code was being run twice!

How to repeat:
"testSystemRestart -n Bug48436" produces it sporadically.

Suggested fix:
run TO directly (instead of waiting until rest of nodes started)
if this happens.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/103571

3154 Jonas Oreland	2010-03-17
      ndb - fix bug#52135 - TO of master! during SR

Pushed into 5.1.44-ndb-6.3.33 (revid:jonas@mysql.com-20100317122157-fyqy826tgsi6wtin) (version source revid:jonas@mysql.com-20100317121123-w4sms01noqkx4o65) (merge vers: 5.1.44-ndb-6.3.33) (pib:16)

Pushed into 5.1.44-ndb-7.0.14 (revid:jonas@mysql.com-20100317123740-n3dvpvoa2p9x7oq6) (version source revid:jonas@mysql.com-20100317123551-tye1spfcw9u2ayep) (merge vers: 5.1.44-ndb-7.0.14) (pib:16)

pushed to 6.3.33,7.0.14 and 7.1.3

Documented in the NDB-6.3.33, 7.0.14, and 7.1.3 changelogs, as follows:

        When performing a complex mix of node restarts and system
        restarts, a node that was elected as master sometimes required
        optimized node-recovery due to missing REDO information. When
        this happened, the node crashed with Failure to recreate object
        ... during restart, error 721 (because the DBDICT restart code
        was run twice). Now when this occurs, node takeover is executed
        immediately, rather than being made to wait until the remaining
        data nodes have started.

Closed.