MySQL Bugs: #44427: NDB_MGMD does not reconnect to cluster following configuration change

Bug #44427	NDB_MGMD does not reconnect to cluster following configuration change
Submitted:	23 Apr 2009 10:50	Modified:	30 Apr 2009 7:29
Reporter:	Phil Bayfield	Email Updates:
Status:	Won't fix	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	6.3.24	OS:	Linux
Assigned to:	Magnus Blåudd	CPU Architecture:	Any
Tags:	ndb_mgmd

Description:
I recently added more MYSQLD slots to my cluster config to enable connection pooling for MYSQLD nodes.

I added too many slots and want to remove the unused slots.

Current configuration:

Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=3    @192.168.10.6  (mysql-5.1.32 ndb-6.3.24, Nodegroup: 0, Master)
id=4    @192.168.10.7  (mysql-5.1.32 ndb-6.3.24, Nodegroup: 0)

[ndb_mgmd(MGM)] 2 node(s)
id=1    @192.168.10.2  (mysql-5.1.32 ndb-6.3.24)
id=2    @192.168.10.3  (mysql-5.1.32 ndb-6.3.24)

[mysqld(API)]   13 node(s)
id=5    @192.168.10.4  (mysql-5.1.32 ndb-6.3.24)
id=6    @192.168.10.4  (mysql-5.1.32 ndb-6.3.24)
id=7    @192.168.10.4  (mysql-5.1.32 ndb-6.3.24)
id=8    @192.168.10.4  (mysql-5.1.32 ndb-6.3.24)
id=9 (not connected, accepting connect from 192.168.10.4)
id=10   @192.168.10.5  (mysql-5.1.32 ndb-6.3.24)
id=11   @192.168.10.5  (mysql-5.1.32 ndb-6.3.24)
id=12   @192.168.10.5  (mysql-5.1.32 ndb-6.3.24)
id=13 (not connected, accepting connect from 192.168.10.5)
id=14 (not connected, accepting connect from 192.168.10.5)
id=15 (not connected, accepting connect from 192.168.10.4)
id=16 (not connected, accepting connect from 192.168.10.6)
id=17 (not connected, accepting connect from 192.168.10.7)

If I remove any of the unused slots, the management server fails to reconnect to the cluster properly (example here of removing just 1 slot for 192.168.10.5):

ndb_mgm> 1 restart
Shutting down MGM node 1 for restart
Node 1 is being restarted

ndb_mgm> show
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=3 (not connected, accepting connect from 192.168.10.6)
id=4 (not connected, accepting connect from 192.168.10.7)

[ndb_mgmd(MGM)] 2 node(s)
id=1    @192.168.10.2  (mysql-5.1.32 ndb-6.3.24)
id=2 (not connected, accepting connect from 192.168.10.3)

[mysqld(API)]   12 node(s)
id=5 (not connected, accepting connect from 192.168.10.4)
id=6 (not connected, accepting connect from 192.168.10.4)
id=7 (not connected, accepting connect from 192.168.10.4)
id=8 (not connected, accepting connect from 192.168.10.4)
id=9 (not connected, accepting connect from 192.168.10.4)
id=10 (not connected, accepting connect from 192.168.10.5)
id=11 (not connected, accepting connect from 192.168.10.5)
id=12 (not connected, accepting connect from 192.168.10.5)
id=13 (not connected, accepting connect from 192.168.10.5)
id=14 (not connected, accepting connect from 192.168.10.4)
id=15 (not connected, accepting connect from 192.168.10.6)
id=16 (not connected, accepting connect from 192.168.10.7)

If I restore the previous config ndb_mgmd reconnects as normal

How to repeat:
See above

One new feature in 7.0 is that the ndb_mgmd's always have the same configuration. Both of the two ndb_mgmd's need to be started in order to commit a new configuration version(you can see in their log files, that they are "waiting for other nodes").

All the other nodes were still running, however the management servers reported otherwise. NDBD and MYSQLD nodes continued to function as expected. Following reversal of the configuration change the NDB_MGMD nodes reconnected with the other nodes and showed all nodes connected.

Ok, I see you are using 6.3.24 which does not have the new features of 7.0. Sorry!

Have you restarted both management servers with same config.ini? And all nodes one by one. If you remove nodes in a way that cause nodeid's to change, I think you can see this kind of problems, recommend using fixed nodeid's at least for ndbd(s) and ndb_mgmd(s).

Don't think any of my advices are good... the mysqld's in your config all have higher nodeid's then ndbd and ndb_mgmd.

Have you checked cluster log(s) from both ndb_mgmd's, maybe they can give you a hint.

Otherwise upload old and new config.ini and we can take a look. But it's not a big problem for you running with too many mysqld(s) I hope?

Hi Magnus,

Basically what happened was I was looking at Johan Andersson's configurator (http://www.severalnines.com/config/index.php) to see what changes it came out with for the 7.0 series. I noticed the bit about connection pooling and had a read up on this and found that it may improve performance.

The configurator suggested using 5 separate API slots per MySQL server so I modified my existing 6.3 config to this extent. I modified the config's, restarted the data nodes and finally the MySQL servers and it all worked fine. Then through some further reading in the manual I noticed it said we shouldn't have more connections that processors/process cores or it could considerably degrade performance. I then decreased the MySQL server's to 3 connections and restarted them (obviously no problem they reconnected to the cluster with 3 connections instead of 5).

I was attempting to apply the changes to the management nodes in a similar fashion, change the configs and restart both servers. (I realise there is no harm in having the empty slots there.)

After restarting the management nodes, my first reaction was that the cluster had crashed as I've never seen this happen before, even with a config change the management nodes had immediately reconnected to the other nodes.

I checked the cluster logs and it simply showed the normal startup messages for the management node. I then checked my sites and they were still up, then checked the other nodes in the cluster and they were also still up.

At this point I was somewhat confused at to what was going on, I went through all the logs and saw nothing, no error messages etc. I figured I would have to shut the cluster down but had no idea if it was even possible without the management server running!

On the off chance I reverted the config on both management servers back to their original settings and restarted them and then everything reappeared as it should be prior to the config change.

I've since done a full restart of the cluster for another reason and made the config changes in the process so I no longer have any issue. The main reason for reporting the bug was so that others could avoid possibly taking more drastic action unnesecarily.

Thanks for that nice explanation. I interpret it as when you restarted the second ndb_mgmd the problem disappeared.

Could do some experiments, but I'd rather focus on 7.0 where we have added new functionality to make sure that both management servers always use the same configuration and reconfigure themselves without need to restart. This concept will then be applied to ndbapi/mysqld as well as the ndbd nodes - although not everything will be possible to reconfigure without restart, atleast the process should know that it's not running with latest configuration.