Bug #61220 NDBD Shutdown time lag prevent successful restart
Submitted: 18 May 2011 23:24 Modified: 22 Jun 2011 11:37
Reporter: Mike Reid Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:7.1.12 OS:Linux (Ubuntu 10.10)
Assigned to: CPU Architecture:Any
Tags: cluster, MySQL, restart

[18 May 2011 23:24] Mike Reid
Description:
[MgmtSrvr] WARNING  -- Failure handling of node X has not completed in 1 min - state = 6
[MgmtSrvr] WARNING  -- Allocate nodeid (4) failed. Connection from ip x.x.x.x Returned error string "Id X already allocated by another node."

"It seems ndbd is not shutting down quickly enough, so when the angel/watchdog process goes to restart it, there's a failure because the node id isn't allocatable yet."

How to repeat:
Try performing a rolling restart:

update config.ini, stop ndb_mgmd using ndb_mgm> <id> STOP , then restart the ndb_mgmd -f /path/to/config.ini --reload , then from in ndb_mgm> <id> RESTART  for each ndbd node

Upon issuing: ndb_mgm> <id> RESTART the ndbd/ndbmtd process is never restarted... 

Suggested fix:
Ensure proper handling of RESTART where fixed NodeId is required but not available (yet) ...perhaps waiting and attempting restart again (similar to StopOnError=0, which btw setting this did not seem to affect this particular issue.)
[22 May 2011 11:37] Geert Vanderkelen
First, the rolling restart procedure you are showing is not really correct.
You are stopping the data node, but you should simply:
1) restart ndb_mgmd
2) ndb_mgm> <data_node_id> RESTART

For the allocation failure, can you check the cluster log when the connection to the data node was again 'Opened'? It might take some time, but not long. Maybe upload the cluster log (log made by management node)
[22 Jun 2011 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".