MySQL Bugs: #42920: mgm server always connects to port 1186

Bug #42920	mgm server always connects to port 1186
Submitted:	17 Feb 2009 9:23	Modified:	4 Nov 2009 12:22
Reporter:	Lars Torstensson	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	mysql-5.1-telco-6.3	OS:	Linux
Assigned to:	Magnus Blåudd	CPU Architecture:	Any
Tags:	mysql-5.1-telco-6.3, mysql-5.1.31-ndb-6.3.22

Description:
1:
I start a mgm server for cluster X on host H using port 1186

2:
I start a mgm server for cluster Y on host H using port 12000

3: The mgm server for cluster Y will fail to start since it first connects to 1186 for some reason.

Error : Could not alloc node id at localhost port 1186: Id 1 already allocated by another node.

Is this a bug or a feature?

Workaround is to not have cluster X use port 1186

How to repeat:
Description:

Can you please provide config.ini for both your clusters and the command line options you use when starting the ndb_mgmd?

$ ndb_mgmd -f ./config1.ini --ndb-nodeid=3
$ ndb_mgmd -f ./config2.ini --ndb-nodeid=3
Error : Could not alloc node id at localhost port 1186: Id 3 already allocated by another node.

Minimal configuration for testing:
-bash-3.00$ cat config1.ini 
[NDBD DEFAULT]
NoOfReplicas: 2

[NDBD]
Id: 1
hostname=localhost

[NDBD]
Id: 2
hostname=localhost

[NDB_MGMD]
Id: 3
hostname: localhost
PortNumber: 1186

[NDB_MGMD]
Id: 4
hostname: localhost
PortNumber: 10186

[MYSQLD]
Id: 7
hostname: localhost
-bash-3.00$ cat config2.ini 
[NDBD DEFAULT]
NoOfReplicas: 2

[NDBD]
Id: 1
hostname=localhost

[NDBD]
Id: 2
hostname=localhost

[NDB_MGMD]
Id: 3
hostname: localhost
PortNumber: 12000

[NDB_MGMD]
Id: 4
hostname: localhost
PortNumber: 13000

[MYSQLD]
Id: 7
hostname: localhost

Testing the above config on 6.3 from bzr:

hillbilly% (cd 1; ndb_mgmd -f ../config1.ini --ndb-nodeid=3)
hillbilly% ndb_mgm -c localhost:1186 -e show                
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)]	2 node(s)
id=1 (not connected, accepting connect from localhost)
id=2 (not connected, accepting connect from localhost)

[ndb_mgmd(MGM)]	2 node(s)
id=3	@localhost  (mysql-5.1.32 ndb-6.3.24)
id=4 (not connected, accepting connect from localhost)

[mysqld(API)]	1 node(s)
id=7 (not connected, accepting connect from localhost)

hillbilly% (cd 2; ndb_mgmd -f ../config2.ini --ndb-nodeid=3)
Error : Could not alloc node id at localhost port 1186: Id 3 already allocated by another node.

So bug verified as described.

Lets do some more testing.

hillbilly% (cd 2; ndb_mgmd -f ../config2.ini --ndb-nodeid=4)
hillbilly% ndb_mgm -c localhost:13000 -e show               
Unable to connect with connect string: nodeid=0,localhost:13000
Retrying every 5 seconds. Attempts left: 2 1, failed.

So no mgmd on port 13000...

hillbilly% ndb_mgm -c localhost:1186 -e show                
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)]	2 node(s)
id=1 (not connected, accepting connect from localhost)
id=2 (not connected, accepting connect from localhost)

[ndb_mgmd(MGM)]	2 node(s)
id=3	@localhost  (mysql-5.1.32 ndb-6.3.24)
id=4 (not connected, accepting connect from localhost)

[mysqld(API)]	1 node(s)
id=7 (not connected, accepting connect from localhost)

And it haven't ended up on on cluster 1 either.

hillbilly% cat 2/ndb_4_out.log 
NDB Cluster Management Server. mysql-5.1.32 ndb-6.3.24-GA
Id: 4, Command port: *:10186

But it's got the port number that's supposed to be used for mgmd with
id 4 on cluster 1.

hillbilly% ndb_mgm -c localhost:10186 -e show
Connected to Management Server at: localhost:10186
Cluster Configuration
---------------------
[ndbd(NDB)]	2 node(s)
id=1 (not connected, accepting connect from localhost)
id=2 (not connected, accepting connect from localhost)

[ndb_mgmd(MGM)]	2 node(s)
id=3 (not connected, accepting connect from localhost)
id=4	@localhost  (mysql-5.1.32 ndb-6.3.24)

[mysqld(API)]	1 node(s)
id=7 (not connected, accepting connect from localhost)

But it's being part of cluster 2.

This is confusing...

hillbilly% bzr version-info 
revision-id: jonas@mysql.com-20090316151554-pm72q8jn4awmwpse
date: 2009-03-16 16:15:54 +0100
build-date: 2009-03-24 15:37:27 +0100
revno: 2913
branch-nick: cluster-6.3

More testing, with different nodeid for the two mgmd for the second cluster.

hillbilly% (cd 1; ndb_mgmd -f ../config1.ini --ndb-nodeid=3)
hillbilly% cat config3.ini
[NDBD DEFAULT]
NoOfReplicas: 2

[NDBD]
Id: 1
hostname=localhost

[NDBD]
Id: 2
hostname=localhost

[NDB_MGMD]
Id: 5
hostname: localhost
PortNumber: 12000

[NDB_MGMD]
Id: 6
hostname: localhost
PortNumber: 13000

[MYSQLD]
Id: 7
hostname: localhost
hillbilly% (cd 3; ndb_mgmd -f ../config3.ini --ndb-nodeid=5)
Error : Could not alloc node id at localhost port 1186: No node defined with id=5 in config file.

So it doesn't read the configuration but appears to prefer the a mgmd it can find on port 1186.

from  strorage/src/mgmapi/mgmapi.cpp:   [see line marked ==>]

extern "C"
int
ndb_mgm_alloc_nodeid(NdbMgmHandle handle, unsigned int version, int nodetype,
                     int log_event)
{
[snip]
  const ParserRow<ParserDummy> reply[]= {
    MGM_CMD("get nodeid reply", NULL, ""),
      MGM_ARG("error_code", Int, Optional, "Error code"),
      MGM_ARG("nodeid", Int, Optional, "Error message"),
      MGM_ARG("result", String, Mandatory, "Error message"),
    MGM_END()
  };
  
  const Properties *prop;
  prop= ndb_mgm_call(handle, reply, "get nodeid", &args);
  CHECK_REPLY(handle, prop, -1);

  nodeid= -1;
  do {
    const char * buf;
    if (!prop->get("result", &buf) || strcmp(buf, "Ok") != 0)
    {
      const char *hostname= ndb_mgm_get_connected_host(handle);
==>   unsigned port=  ndb_mgm_get_connected_port(handle);
      BaseString err;
      Uint32 error_code= NDB_MGM_ALLOCID_ERROR;
      err.assfmt("Could not alloc node id at %s port %d: %s",
		 hostname, port, buf);
      prop->get("error_code", &error_code);
      setError(handle, error_code, __LINE__, err.c_str());
      break;
    }
    Uint32 _nodeid;
    if(!prop->get("nodeid", &_nodeid) != 0){
      fprintf(handle->errstream, "ERROR Message: <nodeid Unspecified>\n");
      break;
    }
    nodeid= _nodeid;
  }while(0);

  delete prop;
  return nodeid;
}

The call to 'ndb_mgm_get_connect_port' simply returns the port number to which the handle is connected(if any). Doesn't connect to remote host there.

Gustaf, Lars: Could you both check whether you have a my.cnf installed in some default location? 

"Default options are read from the following files in the given order:
/etc/my.cnf /etc/mysql/my.cnf /usr/local/mysqlcluster-63/build/install/etc/my.cnf ~/.my.cnf"

For instance on my kubuntu I have:

$ grep "connect-string" /etc/mysql/my.cnf
/etc/mysql/my.cnf:connect-string=localhost
/etc/mysql/my.cnf:connect-string=localhost

Notice that a ndb_mgmd will read this file on startup, since you didn't specify --defaults-file= on the command line. See ndb_mgmd --help for details. Since all your ndb_mgmd processes will read the same my.cnf file, they will now always connect to localhost:1186 for configuration whether they belong to cluster X or cluster Y.

When playing around with multiple mysql installations, it is usually a good idea to delete the standard rpm|deb package that came with the distribution, since this is a really common error I hit myself quite often.

Please confirm if this is indeed the case.

Further info

On the machines where the behaviour was detected, could you try

$ ./my_print_defaults mysqld --debug"
| | | my: path: '/usr/local/mysql/etc/my.cnf'  stat_area: 0x7fffc92c6040  MyFlags: 0
| | | error: Got errno: 2 from stat

...and look for stat results of all the places where my.cnf is tried. (The file may become quite large, use Ctrl+F or similar in an editor to find it.)

this behaviour is 'fixed' in -7.0  (but i haven't found out what code fixes it -- maybe the new bin config code?)

do we need a patch for 6.3?

@jack: Did you verify that the bug exists on 6.3 and then that it doesn't exist when upgrading to 7.0? If you don't have a my.cnf installed in some locations, then the problem simply doesn't happen and this is regardless of version.

> @jack: Did you verify that the bug exists on 6.3 and then that 
> it doesn't exist when upgrading to 7.0? If you don't have a 
> my.cnf installed in some locations, then the problem simply 
> doesn't happen and this is regardless of version.

i built 7.0 and 6.3.  i ran in 7.0 and the bug didn't occur.
i ran in 6.3 and the problem occurred.  i don't think i have
a my.cnf file -- at least, not one that i created.

Checked the code and unfoirtunately the ndb_mgmd in 6.3 fall back to use the builtin default for port if no NDB_CONNECTSTRING or --ndb-connectstring is used.

That means the best workaround for this bug is to set NDB_CONNECTSTRING=ownhost:ownport and then start the ndb_mgmd, it should now only try to connect to "own_port".

But of course, the ndb_mgmd should not need to connect to any other ndb_mgmd at all this early in the startup phase.

Problem has been fixed in 7.0 by quite a large rewrite of ndbg_mgmd startup, can't really think of a small fix for 6.3 ndb_mgmd.

Fixed in 7.0 by refactoring. Will not fix in 6.3.