Description:
This is sortof a bug, sortof a feature request.
I original posted this bug (in 6.x):
http://bugs.mysql.com/bug.php?id=43709
Which was a duplicate of this bug (in 7.x):
http://bugs.mysql.com/bug.php?id=43224
I realise the problem was originaly the side effect of another bug fix so I assume it is not possible to revert back to they way this originally used to work in previous versions.
I feel the new behaviour is somewhat annoying.
Earlier today I did a release upgrade on one of the ndbd nodes (updated to latest Ubuntu release), rebooted the node and ndbd did not start.
I logged on to the box and tried to start manually, after about 30 seconds I thought I had experienced a repeat of the above bugs:
# /etc/init.d/ndbd
2009-04-29 11:28:46 [ndbd] INFO -- Unable to alloc node id
2009-04-29 11:28:46 [ndbd] INFO -- Error : Could not alloc node id at 192.168.10.3 port 1186: No free node id found for ndbd(NDB).
error=2350
2009-04-29 11:28:46 [ndbd] INFO -- Error handler restarting system
2009-04-29 11:28:46 [ndbd] INFO -- Error handler shutdown completed - exiting
sphase=0
exit=-1
Then I noticed in my cluster log the new functionality seams to be taking over (below is complete log entries some time after having already restarted the node):
2009-04-29 11:23:04 [MgmSrvr] INFO -- Node 4: Node shutdown completed. Initiated by signal 15.
2009-04-29 11:24:02 [MgmSrvr] WARNING -- Node 3: Failure handling of node 4 has not completed in 1 min. - state = 3
2009-04-29 11:25:03 [MgmSrvr] WARNING -- Node 3: Failure handling of node 4 has not completed in 2 min. - state = 3
2009-04-29 11:26:04 [MgmSrvr] WARNING -- Node 3: Failure handling of node 4 has not completed in 3 min. - state = 3
2009-04-29 11:27:04 [MgmSrvr] WARNING -- Node 3: Failure handling of node 4 has not completed in 4 min. - state = 3
2009-04-29 11:28:05 [MgmSrvr] WARNING -- Node 3: Failure handling of node 4 has not completed in 5 min. - state = 3
2009-04-29 11:28:47 [MgmSrvr] INFO -- Node 3: Communication to Node 4 opened
The annoyance is I've had to wait 5 mins after doing a simple reboot of the machine to login and manually start ndbd because of the time delay.
How to repeat:
N/A
Suggested fix:
A few suggestions as alterative behaviour:
1. Reduce the timeout.
2. Friendlier error messages would be helpful, e.g. Waiting for node x to perform failure handling for your node id.
3. Management server reports to the node attempting to start that other node(s) is/are still handling the failure handling to complete and the recovering node waits a similar amount of time before giving up (maybe reports to the cluster log that it is doing so).
4. Node startup sends a signal to running nodes to immediately release the node id so that the node can startup immediately as normal.