MySQL Bugs: #55804: <node>RESTART -n follwed by <node> START fails

Bug #55804	<node>RESTART -n follwed by <node> START fails
Submitted:	6 Aug 2010 14:36	Modified:	3 Mar 2011 13:45
Reporter:	Anders Karlsson	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	mysql-5.1-telco-7.1	OS:	Windows (XP)
Assigned to:		CPU Architecture:	Any

Description:
Running on Windows, a simple Cluster setup with 3 datanodes and 1 mgm node, restarting a data node with the nostart flag (<nodeid> RESRART -n) will cause errors, but not consistently. Sometimes it works, but mostly you get a whole bunch of errors from mgmd. Sometimes, the Cluster starts of, at the same time as errors are flowing from mgmd. If you then START the node (<nodeid> START) will cause it to start, but errors are still coming. Sometimes the data node gets stuck in "starting" phase. All sorts of problems.

How to repeat:
Set up a cluster configured as below (my setup):
<config>
[ndbd default]
NoOfReplicas=2

[mysqld  default]
[ndb_mgmd default]
[tcp default]

[ndb_mgmd]
PortNumber=1186
HostName=127.0.0.1

[ndbd]
HostName=127.0.0.1
DataDir=C:/MySQL714b/node1/data

[ndbd]
HostName=127.0.0.1
DataDir=C:/MySQL714b/node2/data

[mysqld]
[mysqld]
[mysqld]
</config>

Start the mgm and cluster nodes from 3 different DOS windows, so you can see the output from each and every one of them.
Go into ndb_mgm and do a show:
<command>
ndb_mgm> show
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=2    @127.0.0.1  (mysql-5.1.44 ndb-7.1.4, Nodegroup: 0, Master)
id=3    @127.0.0.1  (mysql-5.1.44 ndb-7.1.4, Nodegroup: 0)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @127.0.0.1  (mysql-5.1.44 ndb-7.1.4)

[mysqld(API)]   3 node(s)
id=4 (not connected, accepting connect from any host)
id=5 (not connected, accepting connect from any host)
id=6 (not connected, accepting connect from any host)
</command>

Now, restart node 3 and show status:
<command>
ndb_mgm> 3 restart -n
Node 3: Node shutdown initiated
Node 3: Node shutdown completed, restarting, no start.
Node 3 is being restarted

ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=2    @127.0.0.1  (mysql-5.1.44 ndb-7.1.4, Nodegroup: 0, Master)
id=3    @127.0.0.1  (mysql-5.1.44 ndb-7.1.4, not started)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @127.0.0.1  (mysql-5.1.44 ndb-7.1.4)

[mysqld(API)]   3 node(s)
id=4 (not connected, accepting connect from any host)
id=5 (not connected, accepting connect from any host)
id=6 (not connected, accepting connect from any host)
</command>

Now, restart node 3:
<command>
ndb_mgm> 3 start
</command>

Now, many things can happen here. Sometimes node 3 starts and all is fine. Sometimes, node 3 just dies:
<command>
ndb_mgm> 3 start
Start failed.
*    22: Error
*        No contact with the process (dead ?).: Permanent error: Application error
</command>

Often, the restart works as expected:
<command>
ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=2    @127.0.0.1  (mysql-5.1.44 ndb-7.1.4, Nodegroup: 0, Master)
id=3    @127.0.0.1  (mysql-5.1.44 ndb-7.1.4, not started)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @127.0.0.1  (mysql-5.1.44 ndb-7.1.4)

[mysqld(API)]   3 node(s)
id=4 (not connected, accepting connect from any host)
id=5 (not connected, accepting connect from any host)
id=6 (not connected, accepting connect from any host)
</command>

But the ndb_mgm throws out errors like crazy:
<output>
2010-08-06 16:32:07 [MgmtSrvr] WARNING  -- Failed to convert connection from '127.0.0.1:4342' to transporter
Failed to report event to event log, error: 1502
2010-08-06 16:32:07 [MgmtSrvr] WARNING  -- Failed to convert connection from '127.0.0.1:4343' to transporter
Failed to report event to event log, error: 1502
2010-08-06 16:32:07 [MgmtSrvr] WARNING  -- Failed to convert connection from '127.0.0.1:4344' to transporter
Failed to report event to event log, error: 1502
2010-08-06 16:32:07 [MgmtSrvr] WARNING  -- Failed to convert connection from '127.0.0.1:4345' to transporter
Failed to report event to event log, error: 1502
2010-08-06 16:32:08 [MgmtSrvr] WARNING  -- Failed to convert connection from '127.0.0.1:4346' to transporter
Failed to report event to event log, error: 1502
</output>

At this point, starting node 3 sometimes works, sometimes not. But when it DOWS works, ndb_mgmd just goes on throwing even more errors.

Tested:
mysql-5.1.44-ndb-7.1.4b
mysql-5.1.44-ndb-7.1.5 

on RHEL 5. No issues at all.