MySQL Bugs: #24105: ndb_mgm ALL RESTART -i command run with one node stopped stops all data nodes

Bug #24105	ndb_mgm ALL RESTART -i command run with one node stopped stops all data nodes
Submitted:	8 Nov 2006 21:43	Modified:	6 Dec 2006 10:37
Reporter:	Jim Dowling	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	5.1.12	OS:	Linux (Linux)
Assigned to:	Jonas Oreland	CPU Architecture:	Any
Tags:	all restart --initial, NDB_MGM, not started

Description:
When a single ndbd is in not started state, and 
>ndb_mgm -e "all restart -i"
is executed, all nodes go to "not started state"

ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     4 node(s)
id=1    @127.0.0.1  (Version: 5.1.12, Nodegroup: 0, Master)
id=2    @127.0.0.1  (Version: 5.1.12, Nodegroup: 0)
id=3    @127.0.0.1  (Version: 5.1.12, Nodegroup: 1)
id=4    @127.0.0.1  (Version: 5.1.12, Nodegroup: 1)

[ndb_mgmd(MGM)] 2 node(s)
id=62   @127.0.0.1  (Version: 5.1.12)
id=63   @127.0.0.1  (Version: 5.1.12)

[mysqld(API)]   16 node(s)
id=46 (not connected, accepting connect from any host)
id=47 (not connected, accepting connect from any host)
id=48   @127.0.0.1  (Version: 5.1.12)
id=49 (not connected, accepting connect from any host)
id=50 (not connected, accepting connect from any host)
id=51 (not connected, accepting connect from any host)

ndb_mgm> 2 restart -n
asked to stop 2
Node 2: Node shutdown initiated

ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     4 node(s)
id=1    @127.0.0.1  (Version: 5.1.12, Nodegroup: 0, Master)
id=2    @127.0.0.1  (Version: 5.1.12, not started)
id=3    @127.0.0.1  (Version: 5.1.12, Nodegroup: 1)
id=4    @127.0.0.1  (Version: 5.1.12, Nodegroup: 1)

[ndb_mgmd(MGM)] 2 node(s)
id=62   @127.0.0.1  (Version: 5.1.12)
id=63   @127.0.0.1  (Version: 5.1.12)

[mysqld(API)]   16 node(s)
id=46 (not connected, accepting connect from any host)
id=47 (not connected, accepting connect from any host)
id=48   @127.0.0.1  (Version: 5.1.12)
id=49 (not connected, accepting connect from any host)
id=50 (not connected, accepting connect from any host)
id=51 (not connected, accepting connect from any host)

ndb_mgm> all restart -i

ndb_mgm> show
Connected to Management Server at: localhost:23131
Cluster Configuration
---------------------
[ndbd(NDB)]     4 node(s)
id=1    @127.0.0.1  (Version: 5.1.12, not started)
id=2    @127.0.0.1  (Version: 5.1.12, not started)
id=3    @127.0.0.1  (Version: 5.1.12, not started)
id=4    @127.0.0.1  (Version: 5.1.12, not started)

[ndb_mgmd(MGM)] 2 node(s)
id=62   (Version: 5.1.12)
id=63 (not connected, accepting connect from localhost)

[mysqld(API)]   16 node(s)
id=46 (not connected, accepting connect from any host)
id=47 (not connected, accepting connect from any host)
id=48 (not connected, accepting connect from any host)
id=49 (not connected, accepting connect from any host)
id=50 (not connected, accepting connect from any host)
id=51 (not connected, accepting connect from any host)

How to repeat:
Setup a 4-node cluster; and start it running.

ndb_mgm> all status

ndb_mgm> 1 restart -n

ndb_mgm> all status

ndb_mgm> all restart -i

Now all nodes in a "not started" state.

config.ini

[NDBD DEFAULT]
NoOfReplicas=2
DataMemory=80M  # Reduced to total 100M per replica
IndexMemory=20M
NoOfFragmentLogFiles=25
TimeBetweenLocalCheckpoints=6
MaxNoOfConcurrentOperations=12500
TransactionInactiveTimeout=30000        # 30seconds of inactivity=rollback

[NDB_MGMD]
Hostname=localhost
nodeid=62
portnumber=23131
DataDir=/var/lib/mysql-cluster/dbmgmd1

[NDB_MGMD]
Hostname=localhost
nodeid=63
portnumber=23132
DataDir=/var/lib/mysql-cluster/dbmgmd2

[NDBD]
HostName=localhost
datadir=/var/lib/mysql-cluster/dbdata1
nodeid=1

[NDBD]
HostName=localhost
datadir=/var/lib/mysql-cluster/dbdata2
nodeid=2

[NDBD]
HostName=localhost
datadir=/var/lib/mysql-cluster/dbdata3
nodeid=3

[NDBD]
HostName=localhost
datadir=/var/lib/mysql-cluster/dbdata4
nodeid=4

# Auto-enumerated API node slots,
# Counting down from 61
#
[MYSQLD]
nodeid=61
[MYSQLD]
nodeid=60
[MYSQLD]
nodeid=59
[MYSQLD]
nodeid=58
[MYSQLD]
nodeid=57
[MYSQLD]
nodeid=56
[MYSQLD]
nodeid=55
[MYSQLD]
nodeid=54
[MYSQLD]
nodeid=53
[MYSQLD]
nodeid=52
[MYSQLD]
nodeid=51
[MYSQLD]
nodeid=50
[MYSQLD]

I could repeat using the provided steps. I have attached a shell scripts which I used to check this together with the configuration I used. I verified it with 5.1.12 and 5.1.14 from bk.

When one node is restarted with the -n option, and you do then a ALL RESTART -i, all the other nodes will be also not started. You have to do an ALL START after this procedure.
There is also the problem that ALL RESTART -i is hanging for exactly 5 minutes and it times out giving an illegal reply from server error.

Not doing the RESTART -n on one node doesn't give any problems. So I guess there is something wrong when you combine those two commands after each other. Not doing that is a workaround.

Here is the output of the shell script:

$ ./bug_24105.sh 
shell> ndb_mgm -e "ALL STATUS"
Connected to Management Server at: localhost:1186
Node 3: started (Version 5.1.14)
Node 4: started (Version 5.1.14)
Node 5: started (Version 5.1.14)
Node 6: started (Version 5.1.14)

shell> ndb_mgm -e "3 RESTART -n"
Connected to Management Server at: localhost:1186
Node 3 is being restarted

shell> ndb_mgm -e "ALL STATUS"
Connected to Management Server at: localhost:1186
Node 3: not started (Version 5.1.14)
Node 4: started (Version 5.1.14)
Node 5: started (Version 5.1.14)
Node 6: started (Version 5.1.14)

[geert@crap 12806]$ vim bug_24105.sh 
[geert@crap 12806]$ ./bug_24105.sh 
shell> ndb_mgm -e "ALL STATUS"
Connected to Management Server at: localhost:1186
Node 3: started (Version 5.1.14)
Node 4: started (Version 5.1.14)
Node 5: started (Version 5.1.14)
Node 6: started (Version 5.1.14)

shell> ndb_mgm -e "3 RESTART -n"
Connected to Management Server at: localhost:1186
Node 3 is being restarted

shell> ndb_mgm -e "ALL STATUS"
Connected to Management Server at: localhost:1186
Node 3: not started (Version 5.1.14)
Node 4: started (Version 5.1.14)
Node 5: started (Version 5.1.14)
Node 6: started (Version 5.1.14)

shell> date
Thu Nov 23 14:49:29 CET 2006
shell> ndb_mgm -e "ALL RESTART -i"
Connected to Management Server at: localhost:1186
Executing RESTART on all nodes.
Starting shutdown. This may take a while. Please wait...
Restart failed.
*  1006: Illegal reply from server
*        
Trying to start all nodes of system.
Use ALL STATUS to see the system start-up phases.

shell> date
Thu Nov 23 14:54:29 CET 2006
shell> ndb_waiter 2>/dev/null 1>&2
shell> ndb_mgm -e "ALL STATUS"
Connected to Management Server at: localhost:1186
Node 3: not started (Version 5.1.14)
Node 4: not started (Version 5.1.14)
Node 5: not started (Version 5.1.14)
Node 6: not started (Version 5.1.14)

Script for repeating.

Attachment: bug_24105.sh (application/x-shellscript, text), 606 bytes.

Cluster configuration used for repeating case.

Attachment: 51bk-quattro-config.ini (application/octet-stream, text), 708 bytes.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/16100

ChangeSet@1.2332, 2006-11-29 13:10:14+01:00, jonas@perch.ndb.mysql.com +1 -0
  ndb - bug#24105
    Handle not started nodes correctly (for X restart)
    i.e dont wait for NF_COMPLETEREP
        but settle with NODEFAIL_REP

Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented bugfix in 5.1.14 changelog

Reworded synopsis