MySQL Bugs: #4761: management server does not see other cluster nodes

Bug #4761	management server does not see other cluster nodes
Submitted:	26 Jul 2004 21:24	Modified:	26 Aug 2004 11:51
Reporter:	Justin Swanhart	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-4.1.4-beta (Source distribution)	OS:	Linux (RHAS 3)
Assigned to:	Jonas Oreland	CPU Architecture:	Any

Description:
I have a four machine cluster.  There is one management node, four database nodes and four API nodes.  Each API node is a mysqld server.

The management node doesn't "see" any of the other nodes.  It always shows them as "not connected".  Because of this I can not gracefully shutdown nodes and I am unable to backup the database or change logging settings.

When I run the management server it generates the
following output:

[root@george mgmt]# ./ndb_mgmd -c config.ini
NDB Cluster Management Server. Version 3.5.0 (beta)
Command port: 2200, Statistics port: 0
NDB> reportError (2, 32774)
reportError (3, 32774)
reportError (4, 32774)
reportError (5, 32774)
reportError (2, 32774)
[repeats forever]

I can start the database nodes on all four machines,
and they start up properly and create an initial
database (ndb -i)

When I connect to the management server with ndb_mgm
and do a "SHOW" I get this:

[root@george mgmt]# ndb_mgm
-- NDB Cluster -- Management Client --
Connecting to Management Server: localhost:2200
NDB> show
Cluster Configuration
---------------------
4 NDB Node(s)
DB node:        2  (not connected)
DB node:        3  (not connected)
DB node:        4  (not connected)
DB node:        5  (not connected)

4 API Node(s)
API node:       6  (not connected)
API node:       7  (not connected)
API node:       8  (not connected)
API node:       9  (not connected)

1 MGM Node(s)
MGM node:       1  (Version: 3.5.0)

NDB>
---------------------------------

however, as I said, the database nodes are working and
my mysql processes can use them.  I created the
test_ndb_table through the mysql server on ringo

[root@george mgmt]# ndb_show_tables|grep UserTable
id    type                 state    logging database    schema   name
4     UserTable         Online   Yes      test           def         test_ndb_table

Here is my config.ini file:
[root@george mgmt]# cat config.ini
[COMPUTER]
Id: 1
ByteOrder: Little
HostName: ringo.db.xxxx.com

[COMPUTER]
Id: 2
ByteOrder: Little
HostName: george.db.xxxx.com

[COMPUTER]
Id: 3
ByteOrder: Little
HostName: paul.db.xxxx.com

[COMPUTER]
Id: 4
ByteOrder: Little
HostName: john.db.xxxx.com

[MGM]
Id: 1
ExecuteOnComputer: 1
PortNumber: 2200
PortNumberStats: 2201
ArbitrationRank: 1

[DB DEFAULT]
NoOfReplicas: 2
LockPagesInMainMemory: N
StopOnError: Y
MaxNoOfConcurrentOperations: 16384
MaxNoOfConcurrentTransactions: 1024
IndexMemory: 256M
DataMemory: 2G
TimeBetweenLocalCheckpoints: 20
TimeBetweenGlobalCheckpoints: 1500
NoOfFragmentLogFiles: 8
BackupMemory: 16M
BackupDataBufferSize: 4M
BackupLogBufferSize: 4M
BackupWriteSize: 32k

[DB]
Id: 2
ExecuteOnComputer: 1
FileSystemPath: /usr/local/mysql/ndb_nodes/data

[DB]
Id: 3
ExecuteOnComputer: 2
FileSystemPath: /usr/local/mysql/ndb_nodes/data

[DB]
Id: 4
ExecuteOnComputer: 3
FileSystemPath: /usr/local/mysql/ndb_nodes/data

[DB]
Id: 5
ExecuteOnComputer: 4
FileSystemPath: /usr/local/mysql/ndb_nodes/data

[API]
Id: 6
ExecuteOnComputer: 1

[API]
Id: 7
ExecuteOnComputer: 2

[API]
Id: 8
ExecuteOnComputer: 3

[API]
Id: 9
ExecuteOnComputer: 4

[TCP DEFAULT]
PortNumber: 10002

My Ndb.cfg is configured as follows (this is the data node on george)
[root@george data]# cat Ndb.cfg
nodeid=3
host=140.99.99.114:2200

How to repeat:
I am unsure if others are having this problem.  If they are it should be easy to duplicate.  If it is something in my environment, then I am unsure how to recreate it outside of here.

The machines are configured with "real" IP addresses, and they are connected via gigabit ethernet.  There are two switches, and two machines are connected to each switch.

The hostnames are both in DNS and in /etc/hosts

Suggested fix:
None at this time.

the 32774 error indicated that the mgm server fails to bind its socket.
This is likely means that the ports are already busy.
check wo/ ndb cluster started "netstat -a"

It looks like the ndb servers aren't using the ports based on the [TCP DEFAULT] section...

[root@george ndb_nodes]# lsof|grep ndbd|grep TCP
ndbd      19861    root    4u  IPv4    1223458                   TCP george.db.eldosales.com:2206->john.db.eldosales.com:49001 (ESTABLISHED)
ndbd      19861    root    5u  IPv4    1228037                   TCP george.db.eldosales.com:50905->ringo.db.eldosales.com:2202 (ESTABLISHED)
ndbd      19861    root    6u  IPv4    1223273                   TCP george.db.eldosales.com:2205->paul.db.eldosales.com:49198 (ESTABLISHED)
ndbd      19861    root   18u  IPv4    2889699                   TCP george.db.eldosales.com:2213 (LISTEN)
ndbd      19861    root   19u  IPv4    1225543                   TCP george.db.eldosales.com:2221 (LISTEN)
ndbd      19861    root   20u  IPv4    1225544                   TCP george.db.eldosales.com:2225 (LISTEN)
ndbd      19861    root   23u  IPv4    2888399                   TCP george.db.eldosales.com:2217->george.db.eldosales.com:58188 (ESTABLISHED)

here is the netstat output -a (before running the management server)
as you can see, there isn't anything running on port 2200 which is what I have the management server configured for in the config file.

[root@george ndb_nodes]# netstat -a
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 *:32768                 *:*                     LISTEN
tcp        0      0 *:nfs                   *:*                     LISTEN
tcp        0      0 *:833                   *:*                     LISTEN
tcp        0      0 localhost:32769         *:*                     LISTEN
tcp        0      0 *:shell                 *:*                     LISTEN
tcp        0      0 *:32772                 *:*                     LISTEN
tcp        0      0 george.db.eldosale:2213 *:*                     LISTEN
tcp        0      0 *:mysql                 *:*                     LISTEN
tcp        0      0 george.db.eldosale:2221 *:*                     LISTEN
tcp        0      0 *:sunrpc                *:*                     LISTEN
tcp        0      0 george.db.eldosale:2225 *:*                     LISTEN
tcp        0      0 *:851                   *:*                     LISTEN
tcp        0      0 *:ftp                   *:*                     LISTEN
tcp        0      0 *:ssh                   *:*                     LISTEN
tcp        0      0 localhost:ipp           *:*                     LISTEN
tcp        0      0 *:telnet                *:*                     LISTEN
tcp        0      0 localhost:smtp          *:*                     LISTEN
tcp        0      0 george.db.eldosale:2217 george.db.eldosal:58188 ESTABLISHED
tcp        0      0 george.db.eldosal:58188 george.db.eldosale:2217 ESTABLISHED
tcp        0      0 george.db.eldosale:2205 paul.db.eldosales:49198 ESTABLISHED
tcp        0      0 george.db.eldosal:58190 john.db.eldosales.:2219 ESTABLISHED
tcp        0      0 george.db.eldosal:58189 paul.db.eldosales.:2218 ESTABLISHED
tcp        0      0 george.db.eldosale:2206 john.db.eldosales:49001 ESTABLISHED
tcp        0      0 george.db.eldosal:58191 ringo.db.eldosales:2216 ESTABLISHED
tcp        0      0 george.db.eldosal:50905 ringo.db.eldosales:2202 ESTABLISHED
tcp        0      0 george.db.eldosales:ssh eldo-puter-219.eld:3538 ESTABLISHED
tcp        0      0 george.db.eldosales:ssh eldo-puter-219.eld:3544 ESTABLISHED
tcp        0      0 george.db.eldosal:58383 ringo.db.eldosales.:ssh ESTABLISHED
udp        0      0 *:1024                  *:*
udp      424      0 *:nfs                   *:*
udp        0      0 *:1026                  *:*
udp        0      0 *:830                   *:*
udp        0      0 *:715                   *:*
udp        0      0 *:848                   *:*
udp        0      0 *:sunrpc                *:*
udp        0      0 *:ipp                   *:*
Active UNIX domain sockets (servers and established)
Proto RefCnt Flags       Type       State         I-Node Path
unix  2      [ ACC ]     STREAM     LISTENING     2367   /dev/gpmctl
unix  10     [ ]         DGRAM                    1818   /dev/log
unix  2      [ ACC ]     STREAM     LISTENING     2422   /tmp/.font-unix/fs7100
unix  2      [ ACC ]     STREAM     LISTENING     2888376 /tmp/mysql.sock
unix  3      [ ]         STREAM     CONNECTED     2885698
unix  3      [ ]         STREAM     CONNECTED     2885697
unix  2      [ ]         DGRAM                    3666
unix  2      [ ]         DGRAM                    2456
unix  2      [ ]         DGRAM                    2385
unix  2      [ ]         DGRAM                    2350
unix  2      [ ]         DGRAM                    2336
unix  2      [ ]         DGRAM                    2190
unix  2      [ ]         DGRAM                    1894
unix  2      [ ]         DGRAM                    1826

I shut everything down on the box except for ssh and other necessary services..

[root@george bin]# netstat -a
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 *:ssh                   *:*                     LISTEN
tcp        0      0 localhost:ipp           *:*                     LISTEN
tcp        0      0 localhost:smtp          *:*                     LISTEN
tcp        0      0 george.db.eldosales:ssh eldo-puter-219.eld:3538 ESTABLISHED
tcp        0    232 george.db.eldosales:ssh eldo-puter-219.eld:3544 ESTABLISHED
tcp        0      0 george.db.eldosal:58383 ringo.db.eldosales.:ssh ESTABLISHED
udp        0      0 *:ipp                   *:*
Active UNIX domain sockets (servers and established)
Proto RefCnt Flags       Type       State         I-Node Path
unix  2      [ ACC ]     STREAM     LISTENING     2367   /dev/gpmctl
unix  7      [ ]         DGRAM                    1818   /dev/log
unix  2      [ ACC ]     STREAM     LISTENING     2422   /tmp/.font-unix/fs7100
unix  3      [ ]         STREAM     CONNECTED     2885698
unix  3      [ ]         STREAM     CONNECTED     2885697
unix  2      [ ]         DGRAM                    2456
unix  2      [ ]         DGRAM                    2385
unix  2      [ ]         DGRAM                    2350
unix  2      [ ]         DGRAM                    2336
unix  2      [ ]         DGRAM                    1826

after starting the server, I verify that it is listening on port 2200
[root@george mgmt]# ./ndb_mgmd -c config.ini -d
[root@george mgmt]# netstat -a
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 *:61515                 *:*                     LISTEN
tcp        0      0 *:ssh                   *:*                     LISTEN
tcp        0      0 localhost:ipp           *:*                     LISTEN
tcp        0      0 *:2200                  *:*                     LISTEN
...

notice it is also appears to be listening on 61515
and it isn't listening on 2201, which I would think it should be for the stats port

I go ahead and start up ndbd then do another netstat

[root@george data]# netstat -a
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 george.db.eldosal:10017 *:*                     LISTEN
tcp        0      0 george.db.eldosal:10021 *:*                     LISTEN
tcp        0      0 george.db.eldosal:10025 *:*                     LISTEN
tcp        0      0 *:61517                 *:*                     LISTEN
tcp        0      0 george.db.eldosal:10005 *:*                     LISTEN
tcp        0      0 *:ssh                   *:*                     LISTEN
tcp        0      0 localhost:ipp           *:*                     LISTEN
tcp        0      0 *:2200                  *:*                     LISTEN
tcp        0      0 localhost:smtp          *:*                     LISTEN
tcp        0      0 george.db.eldosal:10013 *:*                     LISTEN
tcp        0      0 george.db.eldosal:10006 john.db.eldosales:62890 ESTABLISHED
tcp        0    180 george.db.eldosales:ssh eldo-puter-219.eld:3544 ESTABLISHED
tcp        0      0 george.db.eldosal:61896 ringo.db.eldosale:10002 ESTABLISHED
udp        0      0 *:ipp                   *:*
Active UNIX domain sockets (servers and established)
Proto RefCnt Flags       Type       State         I-Node Path
unix  2      [ ACC ]     STREAM     LISTENING     2367   /dev/gpmctl
unix  7      [ ]         DGRAM                    1818   /dev/log
unix  2      [ ACC ]     STREAM     LISTENING     2422   /tmp/.font-unix/fs7100
unix  2      [ ]         DGRAM                    2456
unix  2      [ ]         DGRAM                    2385
unix  2      [ ]         DGRAM                    2350
unix  2      [ ]         DGRAM                    2336
unix  2      [ ]         DGRAM                    1826

Then i pull up the mgm server to see if it sees the nodes...

[root@george bin]# ./ndb_mgm
-- NDB Cluster -- Management Client --
Connecting to Management Server: localhost:2200
NDB> show
Cluster Configuration
---------------------
4 NDB Node(s)
DB node:        2  (not connected)
DB node:        3  (not connected)
DB node:        4  (not connected)
DB node:        5  (not connected)

4 API Node(s)
API node:       6  (not connected)
API node:       7  (not connected)
API node:       8  (not connected)
API node:       9  (not connected)

1 MGM Node(s)
MGM node:       1  (Version: 3.5.0)

NDB>

no dice

Do you still get the "report Error" when you "shut everything down"?

yeah.  I get the report error when I start the management daemon, even when no other nodes are running.

[mysql@george mgmt]$ head node1.out
NDB Cluster Management Server. Version 3.5.0 (beta)
Command port: 2200, Statistics port: 0

[mysql@george mgmt]$ tail node1.out
reportError (5, 32774)
reportError (2, 32774)
reportError (3, 32774)
reportError (4, 32774)
reportError (5, 32774)

Could the errors be because the server really is trying to use port 0 for the stats port?

Justin, is this bug still active?

Yes.  My mgm node still can't see the data nodes.

here is a snippet from an strace that I did to see what was going on:

[pid 21308] bind(9, {sa_family=AF_INET, sin_port=htons(10011), sin_addr=inet_addr("140.99.99.117")}, 16 <unfinished ...>
[pid 21308] <... bind resumed> )        = -1 EADDRNOTAVAIL (Cannot assign requested address)

The reason it can't bind to 140.99.99.117 is because george (the box that I ran the ndb server on) has an IP address of 140.99.99.114  

I'm not sure where it is getting the .117 address from.  I compiled on the machine with the .117 address (ringo) but I can't think of any reason that would cause the problem.

I am going to "bk pull" and see if the problem exists in the newest build since I know some other changes have been made to the management server.

I pulled the latest sources on 8/6/04

I no longer get the reportError messages, however, the management server still has no contact from the data nodes:

[mysql@ringo data]$ ndb_mgm
-- NDB Cluster -- Management Client --
Connecting to Management Server: george:2200
NDB> show
Cluster Configuration
---------------------
4 NDB Node(s)
DB node:        2  (not connected)
DB node:        3  (not connected)
DB node:        4  (not connected)
DB node:        5  (not connected)

4 API Node(s)
API node:       6  (not connected)
API node:       7  (not connected)
API node:       8  (not connected)
API node:       9  (not connected)

1 MGM Node(s)
MGM node:       1  (Version: 3.5.0)

NDB>

What additional information can I provide?

Would you like direct access to the machines that are having the problem so that you can investigate it more fully?

Is there any way I can turn on more verbose debugging/tracing to determine exactly why the mgmt server can not see the ndb nodes?

Sorry for getting back to you more...
Anyway,
yes since I'm currently out of ideas, but you are apparently not the only one with the problem, getting access to you machine would be nice...

w.r.t to debugging, there is no way without modifying the source to enable more debugging.

Sorry for getting back to you more...
Anyway,
yes since I'm currently out of ideas, but you are apparently not the only one with the problem, getting access to you machine would be nice...

w.r.t to debugging, there is no way without modifying the source to enable more debugging.

Hello,

I have more or less the same pb on a two computer cluster:

Computer 1:
 - ndb_mgmd on node 1
 - ndbd on node 2

Computer 2:
 - ndbd on node 3

NDB> show
Cluster Configuration
---------------------
2 NDB Node(s)
DB node:        2  (Version: 3.5.0)
DB node:        3  (not connected)

4 API Node(s)
API node:       11  (not connected)
API node:       12  (not connected)
API node:       13  (not connected)
API node:       14  (not connected)

1 MGM Node(s)
MGM node:       1  (Version: 3.5.0)

on computer 1:
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode      PID/Program name
tcp        0      0 127.0.0.1:28002         0.0.0.0:*               LISTEN     1008       141724     18627/ndbd
tcp        0      0 127.0.0.1:28004         0.0.0.0:*               LISTEN     1008       141709     18614/ndb_mgmd
tcp        0      0 0.0.0.0:32792           0.0.0.0:*               LISTEN     1008       141704     18614/ndb_mgmd
tcp        0      0 0.0.0.0:2200            0.0.0.0:*               LISTEN     1008       141703     18614/ndb_mgmd
tcp        0      0 127.0.0.1:32795         127.0.0.1:2200          TIME_WAIT  0          0          -
tcp        0      0 127.0.0.1:28003         127.0.0.1:32794         ESTABLISHED1008       141725     18614/ndb_mgmd
tcp        0      0 127.0.0.1:32794         127.0.0.1:28003         ESTABLISHED1008       141723     18627/ndbd

on computer 2 there are no open socket for ndbd, except for a very short time when contacting ndb_mgmd on port 2200.

Problem solved for me, i changed the 
HostName: localhost 
to HostName: <external-ipaddress>
and everything goes ok

Hi,

I just submitted a patch that performs *much* more vaildations before starting.
If you could try it, and see if it gives you a clever error message :-)

I will do a bk pull on monday and let you know what I find.

ndb_mgmd now segfaults when I try to run it with the newest bk clone...

Process 28382 resumed (parent 28381 ready)
child_stack=0x4001db08, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|CLONE_DETACHED, parent_tidptr=0x4001dbf8, {entry_number:0, base_addr:0x4001dbb0, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0x4001dbf8) = 28382
[pid 28382] --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
[pid 28381] mmap2(NULL, 32768, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
[pid 28382] --- SIGSEGV (Segmentation fault) @ 0 (0) ---
Process 28381 detached
Process 28382 detached