MySQL Bugs: #34201: Unable stop a node when a node in a different group is in "not started" state

Bug #34201	Unable stop a node when a node in a different group is in "not started" state
Submitted:	31 Jan 2008 17:59	Modified:	2 Apr 2008 19:31
Reporter:	David Shrewsbury	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	5.0, 5.1	OS:	Linux
Assigned to:	Tomas Ulin	CPU Architecture:	Any

Description:
In a Cluster with 2 node groups (0 and 1), if a data node in group 0 is placed in the "not started" state (RESTART -n), then you are not allowed to STOP another data node in group 1. You can, however, RESTART a group 1 node.

Tested on versions 5.1.22 and 5.0.54a.

How to repeat:
shell# ndb_mgm -e show
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)]     4 node(s)
id=2    @127.0.0.1  (Version: 5.1.22, Nodegroup: 0, Master)
id=3    @127.0.0.1  (Version: 5.1.22, Nodegroup: 0)
id=4    @127.0.0.1  (Version: 5.1.22, Nodegroup: 1)
id=5    @127.0.0.1  (Version: 5.1.22, Nodegroup: 1)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @127.0.0.1  (Version: 5.1.22)

[mysqld(API)]   3 node(s)
id=20 (not connected, accepting connect from 10.0.1.20)
id=21 (not connected, accepting connect from 10.0.1.30)
id=22 (not connected, accepting connect from any host)

shell# ndb_mgm -e "2 restart -n"
Connected to Management Server at: localhost:1186
Node 2 is being restarted

shell# ndb_mgm -e show
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)]     4 node(s)
id=2    @127.0.0.1  (Version: 5.1.22, not started)
id=3    @127.0.0.1  (Version: 5.1.22, Nodegroup: 0)
id=4    @127.0.0.1  (Version: 5.1.22, Nodegroup: 1)
id=5    @127.0.0.1  (Version: 5.1.22, Nodegroup: 1)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @127.0.0.1  (Version: 5.1.22)

[mysqld(API)]   3 node(s)
id=20 (not connected, accepting connect from 10.0.1.20)
id=21 (not connected, accepting connect from 10.0.1.30)
id=22 (not connected, accepting connect from any host)

shell# ndb_mgm -e "4 stop"
Connected to Management Server at: localhost:1186
Shutdown failed.
*  2002: Stop failed
*        Operation not allowed while nodes are starting or stopping.: Permanent error: Application error

shell# ndb_mgm -e "4 restart"
Connected to Management Server at: localhost:1186
Shutting down nodes with "-n, no start" option, to subsequently start the nodes.
Node 4 is being restarted

1. This is not a regression, it has been like this since 2005.
2. there is a workaround which is to use "4 stop -a"

patch

Attachment: tmp.patch (text/x-patch), 3.14 KiB.

Patch:

1. make sure you can stop when node in SL_CMVMI (adresses bug as such)
2. this however increases probability of hitting bug  Bug #13461 Slave Cluster crashed on restart of two data nodes in separate groups
3. so adding code in restart node to "make sure" node is not stopping while restarting, and wait for any stopping nodes, before starting them again
4. also Bug #13461 was present in restart node as well so added that bugfix there as well

comments on patch:
1) why dont you put the loop-check into a (static) subroutine?
   (it's non-trivial and repeated in 3 places)
2) should you really retry *for ever* (in start)

comment on triage: ok regression since 2005 decreases impact to I4

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/43998

ChangeSet@1.2539, 2008-03-14 14:02:27+01:00, tomas@poseidon.ndb.mysql.com +2 -0
  Bug #34201 Unable stop a node when a node in a different group is in "not started" state

Documented in the 5.1.23-ndb-6.3.11 changelog as follows:

        If a data node in one node group was placed in the not started state
        (using node_id RESTART -n), it was not possible to stop a data node in
        a different node group.

Left in Patch Pending state pending further merges.

Also noted in the 5.1.23-ndb-6.2.15 changelog.

For MySQL Cluster NDB 6.2, the fix actually first appears in 6.2.16, not 6.2.15.

Pushed into 6.0.6-alpha  (revid:sp1r-tomas@poseidon.ndb.mysql.com-20080314130227-25803) (version source revid:sp1r-tomas@poseidon.ndb.mysql.com-20080516085603-30848) (pib:5)