MySQL Bugs: #58240: Ndb : Multi-node failure handling in QMGR can send invalid signal

Bug #58240	Ndb : Multi-node failure handling in QMGR can send invalid signal
Submitted:	16 Nov 2010 17:49	Modified:	6 Jan 2011 6:09
Reporter:	Frazer Clement	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	mysql-5.1-telco-6.3	OS:	Any
Assigned to:	Frazer Clement	CPU Architecture:	Any

Description:
When testing multi-node failure scenarios, noted problem with Failed nodeid received in signal PREP_FAILREQ.  Specifically there is a junk nodeid.

Adding more assertions reveals that the root of the problem is in CLOSE_COMCONF handling code in QMGR which stores the same nodeid in the c_failedNodes array twice.  This is then used to build a bitmap when passed to receivers of PREP_FAILREQ and they read junk as the bitmap has less set bits than the NumberOfFailedNodes indicates.

Code is not commented, but guess that :
  Node receives PREP_FAILREQ from president for node(s) A,B
    Node sends CLOSE_COMREQ to CMVMI
    ...
    Node receives CLOSE_COMCONF from CMVMI
  Node checks whether CLOSE_COMCONF included all failed nodes
    If so, send PREP_FAILCONF to president
    If not, send PREP_FAILREF to president - there's more failures for it to commit.

  'all failed nodes' is normally maintained by FAIL_REP handling.

It's not clear that this bug is serious, as generally it is seen in a situation where the cluster is likely to fail in any case (too many nodes have failed).

How to repeat:
Multi-node simultaneous failure situation (suggest 4 or more nodes).

Surviving nodes processing asynchronous failure notifications.

Sometimes QMGR asserts.

Suggested fix:
There's a dubious looking loop in Qmgr::execCLOSE_COMCONF() which starts from 1 for no known reason :

  if (tprepFailConf == ZFALSE) {
    jam();
    for (Tindex = 1; Tindex < MAX_NDB_NODES; Tindex++) {
      cfailedNodes[Tindex] = cprepFailedNodes[Tindex];
    }//for

This will copy prepFailedNodes to cfailedNodes, except for the first entry!  As far as I can see, this should start at 0.

Other dubious code here :
  cprepFailedNodes array overwritten with contents of CLOSE_COMREQ bitmask, and new size stored in local arrayIndex.  Then, later, cprepFailedNodes updated using cnoPrepFailedNodes as an index which appears unrelated to arrayIndex.

Suggest :
 1) Get agreement of semantics of this code
 2) Re-implement with comments.

Patch highlighting interesting areas

Attachment: bug58240.patch (text/x-patch), 1.32 KiB.

is there a fix coming ?

Proposed patch against 6.3

Attachment: bug58240.patch (text/x-patch), 2.30 KiB.

Bug is more serious for larger node count clusters, where multi node failures are more likely, and more likely to leave enough nodes for a viable surviving cluster.  In this scenario, this bug could kill enough of the surviving nodes to make cluster survival impossible.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/126654

3362 Frazer Clement	2010-12-13
      Bug#58240 Ndb : Multi-node failure handling in QMGR can send invalid signal
      
      Fix multi-node failure handling code to avoid invalid signal construction.

Pushed into mysql-5.1-telco-7.0 5.1.51-ndb-7.0.21 (revid:frazer@mysql.com-20101213153450-5czlh089vpqfo7v1) (version source revid:frazer@mysql.com-20101213153450-5czlh089vpqfo7v1) (merge vers: 5.1.51-ndb-7.0.21) (pib:23)

Pushed into mysql-5.1-telco-6.3 5.1.51-ndb-6.3.40 (revid:frazer@mysql.com-20101213152410-ozb61hffqx17fkex) (version source revid:frazer@mysql.com-20101213144826-vg323kt6tlb7656h) (merge vers: 5.1.51-ndb-6.3.40) (pib:23)

pushed to 6.3.40, 7.0.21 and 7.1.10

Documented fix in the NDB-6.3.40, 7.0.21, and 7.1.10 changelogs as follows:

        When handling failures of multiple data nodes, an error in the
        construction of internal signals could cause the cluster&apos;s
        remaining nodes to crash. This issue was most likely to affect
        clusters with large numbers of data nodes.

Closed.