MySQL Bugs: #43224: ndbmtd refuses to restart due to node id allocation failure

Bug #43224	ndbmtd refuses to restart due to node id allocation failure
Submitted:	26 Feb 2009 11:37	Modified:	20 Mar 2009 9:32
Reporter:	Guido Ostkamp	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-5.1-telco-6.3	OS:	Solaris
Assigned to:	Jonas Oreland	CPU Architecture:	Any

Description:
Hello,

during high availability tests, in a running cluster consisting of 2 datanodes (node 2 and 3) we killed the ndbmtd on node 3, restarted it and killed it again during restart. On node 2 we had 2 application processes running (1 processing inserts, 1 processing deletes).

Now it refuses to come up with:

009-02-26 12:25:10 [ndbd] INFO     -- Unable to alloc node id
2009-02-26 12:25:10 [ndbd] INFO     -- Error : Could not alloc node id at eibe port 1186: Id 3 already allocated by another node.
error=2350
2009-02-26 12:25:10 [ndbd] INFO     -- Error handler restarting system
2009-02-26 12:25:10 [ndbd] INFO     -- Error handler shutdown completed - exiting
sphase=0
exit=-1

This error persists, even when retrying and waiting several minutes between retries.

There is only node 2 (our first datanode) left running in the cluster (no other nodes). 'netstat -a' on Management node shows no connections between management node and node 3.

We are using bazaar version tomas.ulin@sun.com-20090225160230-u4guch19txy3gcew dated Wed 2009-02-25 17:02:30 on Solaris 10 Sparc compiled with Sun Studio 12 using CC=cc CXX=CC CFLAGS="-xO5 -fast -g -mt -m64" CXXFLAGS="-xO5 -fast -g -mt -m64" ./configure --prefix=/export/home/wsch/6.4_2009_01_29 --with-plugins=all --without-docs --without-man.

I will upload full logs shortly.

Regards

Guido Ostkamp

How to repeat:
see above

ndb_error_reporter output + logs of mgmt node uploaded to FTP server bug-data-43224.tar.bz2.

verified: this is introduced by http://bugs.mysql.com/bug.php?id=42973
problem occurs when a ndbd dies after allocating a node id, but before
making contact to any other ndbd

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/69709

2928 Jonas Oreland	2009-03-19
      ndb - bug#43224 - also let ndbd node allocations timeout

Pushed into 5.1.32-ndb-6.3.24 (revid:jonas@mysql.com-20090319085713-elh5xpfhbnlr74s4) (version source revid:jonas@mysql.com-20090319085713-elh5xpfhbnlr74s4) (merge vers: 5.1.32-ndb-6.3.24) (pib:6)

Pushed into 5.1.32-ndb-7.0.5 (revid:jonas@mysql.com-20090319085927-6qdhku0tcnkuvun1) (version source revid:jonas@mysql.com-20090319085927-6qdhku0tcnkuvun1) (merge vers: 5.1.32-ndb-7.0.5) (pib:6)

Documented bugfix in the NDB-6.3.24 and 7.0.5 changelogs as follows:

        When a data node process had been killed after allocating a node
        ID, but before making contact with any other data node
        processes, it was not possible to restart it due to a node ID
        allocation failure.

        This issue could effect either ndbd or ndbmtd processes.