Bug #78922 Autotest testNodeRestart -n Bug18612SR sometimes did not clear error.
Submitted: 22 Oct 2015 11:59 Modified: 3 Mar 2016 15:23
Reporter: Mauritz Sundell Email Updates:
Status: Closed Impact on me:
None 
Category:Tests: Cluster Severity:S3 (Non-critical)
Version:7.4.8 OS:Any
Assigned to: CPU Architecture:Any

[22 Oct 2015 11:59] Mauritz Sundell
Description:
Autotest testNodeRestart -n Bug18612SR sometimes did not clear error.

Occasionally only half cluster stopped during test instead of
whole cluster.

In these cases errors injecting in the surviving partition were
still set when cluster was started again.

This could introduce test failures, both test itself and in
following test runned on same cluster.

Test testIndex -n DeferredError has been seen failing due to this.

How to repeat:
Code inspection:
In function runBug18612SR one waits for all nodes to stop:
    g_err << "Waiting cluster/nodes no-start" << endl;
    if (restarter.waitClusterNoStart(30))
      if (restarter.waitNodesNoStart(partition0, cnt/2, 10))
        if (restarter.waitNodesNoStart(partition1, cnt/2, 10))
          return NDBT_FAILED;
But the above can succeed if stopping only one partition succeeds.
In that case the error inserts might not been cleared for all nodes in the other partition.

Also on can see data nodes crashing on error 932 for following tests not even use error 932, like testIndex -n DeferredError in 7.4.

Suggested fix:
Makes sure to clear injected error in cases there only half cluster stops.
[3 Mar 2016 10:13] Mauritz Sundell
Posted by developer:
 
Pushed to 7.2.23, 7.3.12, 7.4.9, 7.5.0.
Test case fix.
No documentation needed.
[3 Mar 2016 15:23] Jon Stephens
Testing only, no changelog entry needed. Fixed in NDB 7.2.23, 7.3.12, 7.4.9, 7.5.0. Closed.