MySQL Bugs: #43888: ndbrequire fail in DBDIH during other node failures

Bug #43888	ndbrequire fail in DBDIH during other node failures
Submitted:	26 Mar 2009 16:00	Modified:	2 Apr 2009 8:46
Reporter:	Andrew Hutchings	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:		OS:	Any
Assigned to:	Jonas Oreland	CPU Architecture:	Any

Description:
Cluster with 4 ndbd nodes (IDs 3-6):

Node 6 fails due to a hard system reset
Node 3 fails due to it in startphase 5 when Node 6 fails
Node 5 fails with ndbrequire (below)
Node 4 fails due to Arbitration

The problem is Node 5 failing is unexpected, I believe it is the master node at the time.  Error is:

Time: Thursday 26 March 2009 - 06:23:32
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbdih/DbdihMain.cpp
Error object: DBDIH (Line: 13870) 0x0000000a
Program: /usr/mysql/libexec/ndbd
Pid: 1956
Trace: /user/database/log/ndb_5_trace.log.1
Version: mysql-5.1.32 ndb-6.3.23-GA
***EOM***

Looking at source it is at:

void Dbdih::nodeResetStart(Signal *signal)
...
ndbrequire(m_micro_gcp.m_master.m_state == MicroGcp::M_GCP_IDLE);

How to repeat:
.

reproduced with error insert,
easy to fix

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/71061

2894 Jonas Oreland	2009-04-01
      ndb - bug#43888 - fix race condition with ndb dieing during restart, when just about to be included into gcp

Pushed into 5.1.32-ndb-6.2.18 (revid:jonas@mysql.com-20090401115605-youo23cdc00fceyr) (version source revid:jonas@mysql.com-20090401112538-7we3wp7172fa0drr) (merge vers: 5.1.32-ndb-6.2.18) (pib:6)

Pushed into 5.1.32-ndb-6.3.24 (revid:jonas@mysql.com-20090401122231-l9tvo17bvrt9u63k) (version source revid:jonas@mysql.com-20090401121609-592sd1odszpxryv5) (merge vers: 5.1.32-ndb-6.3.24) (pib:6)

Pushed into 5.1.32-ndb-7.0.5 (revid:jonas@mysql.com-20090401122817-spwyy3i31k8yx4nq) (version source revid:jonas@mysql.com-20090401122652-mei4hg1h61i10ghv) (merge vers: 5.1.32-ndb-7.0.5) (pib:6)

Documented bugfix in the NDB-6.2.18, 6.3.24, and 7.0.5 changelogs as follows:

        A race condition could occur when a data node failed to restart
        just before being included in the next global checkpoint. This
        could cause other data nodes to fail.