Bug #49464 Send start of node start command ignored in restart
Submitted: 4 Dec 2009 18:57 Modified: 28 Nov 2016 14:13
Reporter: Andrew Hutchings Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.1-telco-6.3 OS:Any
Assigned to: Magnus BlÄudd CPU Architecture:Any
Tags: 6.3, 7.0
Triage: Triaged: D3 (Medium) / R6 (Needs Assessment) / E6 (Needs Assessment)

[4 Dec 2009 18:57] Andrew Hutchings
Description:
When restarting a node the MgmtSrvr::restartNodes() function puts the nodes in a nostart state and after some waiting fires the 'start' signal.

If the node is not ready for this signal it is possible it is silently ignored, leaving no error but the data node in a nostart state.

int MgmtSrvr::restartNodes(const Vector<NodeId> &node_ids,
                           int * stopCount, bool nostart,
                           bool initialStart, bool abort,
                           int *stopSelf)
{
...
  for (unsigned i = 0; i < node_ids.size(); i++)
  {
    (void) start(node_ids[i]);
  }
  return 0;
}

MgmtSrvr::start(int nodeId)
{
...
  return ss.sendSignal(nodeId, &ssig) == SEND_OK ? 0 : SEND_OR_RECEIVE_FAILED;
}

How to repeat:
Haven't figured a specific test case, I guess somehow stop the signal?  Or have the node stopped -> 'not started' state take longer than 12 seconds.

Suggested fix:
1. Have a definite timeout with error stating that the node could not be started after X seconds, please start manually.
2. Capture node start signal errors and display them back via. mgmapi.
[28 Nov 2016 14:13] Jon Stephens
Documented fix in the NDB 7.5.5 changelog as follows:

    When a data node was restarted, the node was first stopped, and
    then, after a fixed wait, the management server assumed that the
    node had entered the NOT_STARTED state, at which point, the node
    was sent a start signal. If the node was not ready because it
    had not yet completed stopping (and was therefore not actually
    in NOT_STARTED), the signal was silently ignored.

    To fix this issue, the management server now checks to see
    whether the data node has in fact reached the NOT_STARTED state
    before sending the start signal. The wait for the node to reach
    this state is split into two separate checks:

       -Wait for data nodes to start shutting down (maximum 12
        seconds)
        
       -Wait for data nodes to complete shutting down and reach
        NOT_STARTED state (maximum 120 seconds)
        
    If either of these cases times out, the restart is considered
    failed, and an appropriate error is returned.

Closed.
[25 Oct 2018 23:41] Jon Stephens
See also BUG#92621.