MySQL Bugs: #75218: Failure during cluster join can cause president failure

Bug #75218	Failure during cluster join can cause president failure
Submitted:	15 Dec 2014 16:54	Modified:	12 Jan 2015 11:13
Reporter:	Mikael Ronström	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	7.4.3	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
When a new node starts up it joins the cluster through a cluster join protocol
in QMGR. If the node fails after connecting to the president but not connecting
to another live node and then reconnects and starts up again, then some state
is still left in the live node that didn't see the connect. This remaining
state will cause the live node to send 2 CM_ACKADD messages to the president
which will cause it to crash.

How to repeat:
Run an autotest program that uses CRASH_INSERTION at the right places (already exists
in the code).

Suggested fix:
Handle the state at node failure handling.

Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

Documented fix as follows in the NDB 7.4.3 changelog:

    When a new node failed after connecting to the president but not
    to any other live node, then reconnected and started again, a
    live node that did not see the original connection retained old
    state information. This caused the live node to send redundant
    signals to the president, causing it to fail.
      
  Closed.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html