MySQL Bugs: #40669: ndb_mgm can not restart ndb

Bug #40669	ndb_mgm can not restart ndb_mgmd node
Submitted:	12 Nov 2008 14:22	Modified:	12 Nov 2008 14:52
Reporter:	Wen Xiong	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S4 (Feature request)
Version:	mysql-5.1-telco-7.0	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
The ndb_mgm client can restart the data nodes but not the ndb_mgmd node.

A cluster with one ndb_mgmd node, two data nodes has been started as the following:

Connected to Management Server at: nanna14:16000
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=2    @129.159.118.185  (mysql-5.1.29 ndb-6.4.0, Nodegroup: 0, Master)
id=3    @129.159.118.186  (mysql-5.1.29 ndb-6.4.0, Nodegroup: 0)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @129.159.118.184  (mysql-5.1.29 ndb-6.4.0)

[mysqld(API)]   1 node(s)
id=4    @129.159.118.184  (mysql-5.1.29 ndb-6.4.0)

Since two more data nodes are added, the ndb_mgmd need to be restarted. But using 
ndb_mgm> 1 restart
can not restart the ndb_mgmd node.

The following message will be received:
Restart failed.
*   145: Error
*        Time out talking to management server

The alternative to restart the ndb_mgmd node is to kill the ndb_mgmd process and restart it.

How to repeat:

The following are the config file that I use to start the cluster.

This is the my.cnf file.
[mysql_cluster]
ndb-connectstring=nodeid=1,nanna14:16000

[mysqld]
skip-innodb
ndbcluster
ndb-connectstring=nanna14:16000
socket= /export/home/tmp/wx228566/mysqld-soc
tmp_table_size=1G
max_heap_table_size=1G

This is the config.ini file.
[NDBD DEFAULT]
NoOfReplicas= 2
DataMemory= 1G
IndexMemory= 500M
BackupMemory= 200M
MaxNoOfConcurrentScans = 100
MaxNoOfSavedMessages = 1000
#SendBufferMemory = 2M
NoOfFragmentLogFiles = 32
FragmentLogFileSize = 64M
TimeBetweenLocalCheckpoints=20
CompressedLCP = 1
CompressedBackup = 1
ODirect =1

# Management node
[NDB_MGMD]
Id= 1
HostName= nanna14
PortNumber= 16000
DataDir= /export/home/tmp/wx228566/ndb_mgmd.1/

# Data node
[NDBD]
Id= 2
HostName= nanna15
DataDir= /export/home/tmp/wx228566/ndbd.1/

[NDBD]
Id= 3
HostName= nanna16
DataDir= /export/home/tmp/wx228566/ndbd.2/

[MYSQLD]
HostName= nanna14

After the cluster has been started, edit the config.ini file to add two more ndbd nodes and restart the ndb_mgmd node by using the command
./bin/ndb_mgm --ndb-connectstring="nodeid=1,host=nanna14:16000"
ndb_mgm>1 restart

Then it fails.

The OS is Sun Solaris.

It "works" but you need to disconnect all clients.

msvensson@pilot:~/run$ ../install/6.4/bin/ndb_mgm 
-- NDB Cluster -- Management Client --
ndb_mgm> show
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)]	1 node(s)
id=3 (not connected, accepting connect from localhost)

[ndb_mgmd(MGM)]	1 node(s)
id=2	@localhost  (mysql-5.1.29 ndb-6.4.0)

[mysqld(API)]	12 node(s)
id=10 (not connected, accepting connect from any host)
id=11 (not connected, accepting connect from any host)
id=12 (not connected, accepting connect from any host)
id=13 (not connected, accepting connect from any host)
id=14 (not connected, accepting connect from any host)
id=15 (not connected, accepting connect from any host)
id=16 (not connected, accepting connect from any host)
id=63 (not connected, accepting connect from any host)
id=127 (not connected, accepting connect from any host)
id=192 (not connected, accepting connect from any host)
id=228 (not connected, accepting connect from any host)
id=255 (not connected, accepting connect from any host)

ndb_mgm> 2 restart

The ndb_mgmd now "says":
asked to stop 2
which is me
Waiting for 2 not started

And then when Ctrl-C is hit in ndb_mgm, it will restart:
2008-11-12 15:47:51 [MgmSrvr] INFO     -- Shutting down server...
2008-11-12 15:47:55 [MgmSrvr] INFO     -- Shutdown complete
2008-11-12 15:47:55 [MgmSrvr] INFO     -- Restarting server...

But, if there are other mgmapi clients connected(for example a NDB node, mysqld/ndbapi node) it will continue to "hang".

To fix this we would need to actively abort all connected clients by closing their sockets. Should be possible and quite similar how mysqld will abort connections after SHUTDOWN command.

START/RESTART commands have always worked only with data nodes, and this has long been documented.

Ref. http://dev.mysql.com/doc/refman/4.1/en/mysql-cluster-mgm-client-commands.html http://dev.mysql.com/doc/refman/5.0/en/mysql-cluster-mgm-client-commands.html http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-mgm-client-commands.html

Hence, this is a feature request rather than a bug, and I have changed the severity to match.

While this is quite possibly a "nice to have", I don't agree with "workaround unacceptable" assessment given that it's worked this way for years.