Bug #44540 NDBD restart
Submitted: 29 Apr 2009 13:34 Modified: 7 May 2009 13:19
Reporter: Phil Bayfield Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.1-telco-7.0 OS:Linux
Assigned to: CPU Architecture:Any
Triage: Triaged: D5 (Feature request) / R6 (Needs Assessment) / E6 (Needs Assessment)

[29 Apr 2009 13:34] Phil Bayfield
Description:
This is sortof a bug, sortof a feature request.

I original posted this bug (in 6.x):

http://bugs.mysql.com/bug.php?id=43709

Which was a duplicate of this bug (in 7.x):

http://bugs.mysql.com/bug.php?id=43224

I realise the problem was originaly the side effect of another bug fix so I assume it is not possible to revert back to they way this originally used to work in previous versions.

I feel the new behaviour is somewhat annoying.

Earlier today I did a release upgrade on one of the ndbd nodes (updated to latest Ubuntu release), rebooted the node and ndbd did not start.

I logged on to the box and tried to start manually, after about 30 seconds I thought I had experienced a repeat of the above bugs:

# /etc/init.d/ndbd
2009-04-29 11:28:46 [ndbd] INFO     -- Unable to alloc node id
2009-04-29 11:28:46 [ndbd] INFO     -- Error : Could not alloc node id at 192.168.10.3 port 1186: No free node id found for ndbd(NDB).
error=2350
2009-04-29 11:28:46 [ndbd] INFO     -- Error handler restarting system
2009-04-29 11:28:46 [ndbd] INFO     -- Error handler shutdown completed - exiting
sphase=0
exit=-1

Then I noticed in my cluster log the new functionality seams to be taking over (below is complete log entries some time after having already restarted the node):

2009-04-29 11:23:04 [MgmSrvr] INFO     -- Node 4: Node shutdown completed. Initiated by signal 15.
2009-04-29 11:24:02 [MgmSrvr] WARNING  -- Node 3: Failure handling of node 4 has not completed in 1 min. - state = 3
2009-04-29 11:25:03 [MgmSrvr] WARNING  -- Node 3: Failure handling of node 4 has not completed in 2 min. - state = 3
2009-04-29 11:26:04 [MgmSrvr] WARNING  -- Node 3: Failure handling of node 4 has not completed in 3 min. - state = 3
2009-04-29 11:27:04 [MgmSrvr] WARNING  -- Node 3: Failure handling of node 4 has not completed in 4 min. - state = 3
2009-04-29 11:28:05 [MgmSrvr] WARNING  -- Node 3: Failure handling of node 4 has not completed in 5 min. - state = 3
2009-04-29 11:28:47 [MgmSrvr] INFO     -- Node 3: Communication to Node 4 opened

The annoyance is I've had to wait 5 mins after doing a simple reboot of the machine to login and manually start ndbd because of the time delay.

How to repeat:
N/A

Suggested fix:
A few suggestions as alterative behaviour:

1. Reduce the timeout.

2. Friendlier error messages would be helpful, e.g. Waiting for node x to perform failure handling for your node id.

3. Management server reports to the node attempting to start that other node(s) is/are still handling the failure handling to complete and the recovering node waits a similar amount of time before giving up (maybe reports to the cluster log that it is doing so).

4. Node startup sends a signal to running nodes to immediately release the node id so that the node can startup immediately as normal.
[7 May 2009 13:19] Jonathan Miller
http://bugs.mysql.com/44540