Bug #43888 ndbrequire fail in DBDIH during other node failures
Submitted: 26 Mar 2009 16:00 Modified: 2 Apr 2009 8:46
Reporter: Andrew Hutchings Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version: OS:Any
Assigned to: Jonas Oreland CPU Architecture:Any

[26 Mar 2009 16:00] Andrew Hutchings
Description:
Cluster with 4 ndbd nodes (IDs 3-6):

Node 6 fails due to a hard system reset
Node 3 fails due to it in startphase 5 when Node 6 fails
Node 5 fails with ndbrequire (below)
Node 4 fails due to Arbitration

The problem is Node 5 failing is unexpected, I believe it is the master node at the time.  Error is:

Time: Thursday 26 March 2009 - 06:23:32
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbdih/DbdihMain.cpp
Error object: DBDIH (Line: 13870) 0x0000000a
Program: /usr/mysql/libexec/ndbd
Pid: 1956
Trace: /user/database/log/ndb_5_trace.log.1
Version: mysql-5.1.32 ndb-6.3.23-GA
***EOM***

Looking at source it is at:

void Dbdih::nodeResetStart(Signal *signal)
...
ndbrequire(m_micro_gcp.m_master.m_state == MicroGcp::M_GCP_IDLE);

How to repeat:
.
[31 Mar 2009 13:28] Jonas Oreland
reproduced with error insert,
easy to fix
[1 Apr 2009 11:26] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/71061

2894 Jonas Oreland	2009-04-01
      ndb - bug#43888 - fix race condition with ndb dieing during restart, when just about to be included into gcp
[1 Apr 2009 12:36] Bugs System
Pushed into 5.1.32-ndb-6.2.18 (revid:jonas@mysql.com-20090401115605-youo23cdc00fceyr) (version source revid:jonas@mysql.com-20090401112538-7we3wp7172fa0drr) (merge vers: 5.1.32-ndb-6.2.18) (pib:6)
[1 Apr 2009 12:37] Bugs System
Pushed into 5.1.32-ndb-6.3.24 (revid:jonas@mysql.com-20090401122231-l9tvo17bvrt9u63k) (version source revid:jonas@mysql.com-20090401121609-592sd1odszpxryv5) (merge vers: 5.1.32-ndb-6.3.24) (pib:6)
[1 Apr 2009 12:38] Bugs System
Pushed into 5.1.32-ndb-7.0.5 (revid:jonas@mysql.com-20090401122817-spwyy3i31k8yx4nq) (version source revid:jonas@mysql.com-20090401122652-mei4hg1h61i10ghv) (merge vers: 5.1.32-ndb-7.0.5) (pib:6)
[2 Apr 2009 8:46] Jon Stephens
Documented bugfix in the NDB-6.2.18, 6.3.24, and 7.0.5 changelogs as follows:

        A race condition could occur when a data node failed to restart
        just before being included in the next global checkpoint. This
        could cause other data nodes to fail.