Bug #4761 | management server does not see other cluster nodes | ||
---|---|---|---|
Submitted: | 26 Jul 2004 21:24 | Modified: | 26 Aug 2004 11:51 |
Reporter: | Justin Swanhart | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S1 (Critical) |
Version: | mysql-4.1.4-beta (Source distribution) | OS: | Linux (RHAS 3) |
Assigned to: | Jonas Oreland | CPU Architecture: | Any |
[26 Jul 2004 21:24]
Justin Swanhart
[27 Jul 2004 1:36]
Jonas Oreland
the 32774 error indicated that the mgm server fails to bind its socket. This is likely means that the ports are already busy. check wo/ ndb cluster started "netstat -a"
[27 Jul 2004 1:55]
Justin Swanhart
It looks like the ndb servers aren't using the ports based on the [TCP DEFAULT] section... [root@george ndb_nodes]# lsof|grep ndbd|grep TCP ndbd 19861 root 4u IPv4 1223458 TCP george.db.eldosales.com:2206->john.db.eldosales.com:49001 (ESTABLISHED) ndbd 19861 root 5u IPv4 1228037 TCP george.db.eldosales.com:50905->ringo.db.eldosales.com:2202 (ESTABLISHED) ndbd 19861 root 6u IPv4 1223273 TCP george.db.eldosales.com:2205->paul.db.eldosales.com:49198 (ESTABLISHED) ndbd 19861 root 18u IPv4 2889699 TCP george.db.eldosales.com:2213 (LISTEN) ndbd 19861 root 19u IPv4 1225543 TCP george.db.eldosales.com:2221 (LISTEN) ndbd 19861 root 20u IPv4 1225544 TCP george.db.eldosales.com:2225 (LISTEN) ndbd 19861 root 23u IPv4 2888399 TCP george.db.eldosales.com:2217->george.db.eldosales.com:58188 (ESTABLISHED) here is the netstat output -a (before running the management server) as you can see, there isn't anything running on port 2200 which is what I have the management server configured for in the config file. [root@george ndb_nodes]# netstat -a Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 *:32768 *:* LISTEN tcp 0 0 *:nfs *:* LISTEN tcp 0 0 *:833 *:* LISTEN tcp 0 0 localhost:32769 *:* LISTEN tcp 0 0 *:shell *:* LISTEN tcp 0 0 *:32772 *:* LISTEN tcp 0 0 george.db.eldosale:2213 *:* LISTEN tcp 0 0 *:mysql *:* LISTEN tcp 0 0 george.db.eldosale:2221 *:* LISTEN tcp 0 0 *:sunrpc *:* LISTEN tcp 0 0 george.db.eldosale:2225 *:* LISTEN tcp 0 0 *:851 *:* LISTEN tcp 0 0 *:ftp *:* LISTEN tcp 0 0 *:ssh *:* LISTEN tcp 0 0 localhost:ipp *:* LISTEN tcp 0 0 *:telnet *:* LISTEN tcp 0 0 localhost:smtp *:* LISTEN tcp 0 0 george.db.eldosale:2217 george.db.eldosal:58188 ESTABLISHED tcp 0 0 george.db.eldosal:58188 george.db.eldosale:2217 ESTABLISHED tcp 0 0 george.db.eldosale:2205 paul.db.eldosales:49198 ESTABLISHED tcp 0 0 george.db.eldosal:58190 john.db.eldosales.:2219 ESTABLISHED tcp 0 0 george.db.eldosal:58189 paul.db.eldosales.:2218 ESTABLISHED tcp 0 0 george.db.eldosale:2206 john.db.eldosales:49001 ESTABLISHED tcp 0 0 george.db.eldosal:58191 ringo.db.eldosales:2216 ESTABLISHED tcp 0 0 george.db.eldosal:50905 ringo.db.eldosales:2202 ESTABLISHED tcp 0 0 george.db.eldosales:ssh eldo-puter-219.eld:3538 ESTABLISHED tcp 0 0 george.db.eldosales:ssh eldo-puter-219.eld:3544 ESTABLISHED tcp 0 0 george.db.eldosal:58383 ringo.db.eldosales.:ssh ESTABLISHED udp 0 0 *:1024 *:* udp 424 0 *:nfs *:* udp 0 0 *:1026 *:* udp 0 0 *:830 *:* udp 0 0 *:715 *:* udp 0 0 *:848 *:* udp 0 0 *:sunrpc *:* udp 0 0 *:ipp *:* Active UNIX domain sockets (servers and established) Proto RefCnt Flags Type State I-Node Path unix 2 [ ACC ] STREAM LISTENING 2367 /dev/gpmctl unix 10 [ ] DGRAM 1818 /dev/log unix 2 [ ACC ] STREAM LISTENING 2422 /tmp/.font-unix/fs7100 unix 2 [ ACC ] STREAM LISTENING 2888376 /tmp/mysql.sock unix 3 [ ] STREAM CONNECTED 2885698 unix 3 [ ] STREAM CONNECTED 2885697 unix 2 [ ] DGRAM 3666 unix 2 [ ] DGRAM 2456 unix 2 [ ] DGRAM 2385 unix 2 [ ] DGRAM 2350 unix 2 [ ] DGRAM 2336 unix 2 [ ] DGRAM 2190 unix 2 [ ] DGRAM 1894 unix 2 [ ] DGRAM 1826
[27 Jul 2004 2:34]
Justin Swanhart
I shut everything down on the box except for ssh and other necessary services.. [root@george bin]# netstat -a Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 *:ssh *:* LISTEN tcp 0 0 localhost:ipp *:* LISTEN tcp 0 0 localhost:smtp *:* LISTEN tcp 0 0 george.db.eldosales:ssh eldo-puter-219.eld:3538 ESTABLISHED tcp 0 232 george.db.eldosales:ssh eldo-puter-219.eld:3544 ESTABLISHED tcp 0 0 george.db.eldosal:58383 ringo.db.eldosales.:ssh ESTABLISHED udp 0 0 *:ipp *:* Active UNIX domain sockets (servers and established) Proto RefCnt Flags Type State I-Node Path unix 2 [ ACC ] STREAM LISTENING 2367 /dev/gpmctl unix 7 [ ] DGRAM 1818 /dev/log unix 2 [ ACC ] STREAM LISTENING 2422 /tmp/.font-unix/fs7100 unix 3 [ ] STREAM CONNECTED 2885698 unix 3 [ ] STREAM CONNECTED 2885697 unix 2 [ ] DGRAM 2456 unix 2 [ ] DGRAM 2385 unix 2 [ ] DGRAM 2350 unix 2 [ ] DGRAM 2336 unix 2 [ ] DGRAM 1826 after starting the server, I verify that it is listening on port 2200 [root@george mgmt]# ./ndb_mgmd -c config.ini -d [root@george mgmt]# netstat -a Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 *:61515 *:* LISTEN tcp 0 0 *:ssh *:* LISTEN tcp 0 0 localhost:ipp *:* LISTEN tcp 0 0 *:2200 *:* LISTEN ... notice it is also appears to be listening on 61515 and it isn't listening on 2201, which I would think it should be for the stats port I go ahead and start up ndbd then do another netstat [root@george data]# netstat -a Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 george.db.eldosal:10017 *:* LISTEN tcp 0 0 george.db.eldosal:10021 *:* LISTEN tcp 0 0 george.db.eldosal:10025 *:* LISTEN tcp 0 0 *:61517 *:* LISTEN tcp 0 0 george.db.eldosal:10005 *:* LISTEN tcp 0 0 *:ssh *:* LISTEN tcp 0 0 localhost:ipp *:* LISTEN tcp 0 0 *:2200 *:* LISTEN tcp 0 0 localhost:smtp *:* LISTEN tcp 0 0 george.db.eldosal:10013 *:* LISTEN tcp 0 0 george.db.eldosal:10006 john.db.eldosales:62890 ESTABLISHED tcp 0 180 george.db.eldosales:ssh eldo-puter-219.eld:3544 ESTABLISHED tcp 0 0 george.db.eldosal:61896 ringo.db.eldosale:10002 ESTABLISHED udp 0 0 *:ipp *:* Active UNIX domain sockets (servers and established) Proto RefCnt Flags Type State I-Node Path unix 2 [ ACC ] STREAM LISTENING 2367 /dev/gpmctl unix 7 [ ] DGRAM 1818 /dev/log unix 2 [ ACC ] STREAM LISTENING 2422 /tmp/.font-unix/fs7100 unix 2 [ ] DGRAM 2456 unix 2 [ ] DGRAM 2385 unix 2 [ ] DGRAM 2350 unix 2 [ ] DGRAM 2336 unix 2 [ ] DGRAM 1826 Then i pull up the mgm server to see if it sees the nodes... [root@george bin]# ./ndb_mgm -- NDB Cluster -- Management Client -- Connecting to Management Server: localhost:2200 NDB> show Cluster Configuration --------------------- 4 NDB Node(s) DB node: 2 (not connected) DB node: 3 (not connected) DB node: 4 (not connected) DB node: 5 (not connected) 4 API Node(s) API node: 6 (not connected) API node: 7 (not connected) API node: 8 (not connected) API node: 9 (not connected) 1 MGM Node(s) MGM node: 1 (Version: 3.5.0) NDB> no dice
[28 Jul 2004 18:02]
Jonas Oreland
Do you still get the "report Error" when you "shut everything down"?
[28 Jul 2004 19:37]
Justin Swanhart
yeah. I get the report error when I start the management daemon, even when no other nodes are running.
[28 Jul 2004 20:59]
Justin Swanhart
[mysql@george mgmt]$ head node1.out NDB Cluster Management Server. Version 3.5.0 (beta) Command port: 2200, Statistics port: 0 [mysql@george mgmt]$ tail node1.out reportError (5, 32774) reportError (2, 32774) reportError (3, 32774) reportError (4, 32774) reportError (5, 32774) Could the errors be because the server really is trying to use port 0 for the stats port?
[6 Aug 2004 11:22]
Jonas Oreland
Justin, is this bug still active?
[6 Aug 2004 23:06]
Justin Swanhart
Yes. My mgm node still can't see the data nodes. here is a snippet from an strace that I did to see what was going on: [pid 21308] bind(9, {sa_family=AF_INET, sin_port=htons(10011), sin_addr=inet_addr("140.99.99.117")}, 16 <unfinished ...> [pid 21308] <... bind resumed> ) = -1 EADDRNOTAVAIL (Cannot assign requested address) The reason it can't bind to 140.99.99.117 is because george (the box that I ran the ndb server on) has an IP address of 140.99.99.114 I'm not sure where it is getting the .117 address from. I compiled on the machine with the .117 address (ringo) but I can't think of any reason that would cause the problem. I am going to "bk pull" and see if the problem exists in the newest build since I know some other changes have been made to the management server.
[7 Aug 2004 23:50]
Justin Swanhart
I pulled the latest sources on 8/6/04 I no longer get the reportError messages, however, the management server still has no contact from the data nodes: [mysql@ringo data]$ ndb_mgm -- NDB Cluster -- Management Client -- Connecting to Management Server: george:2200 NDB> show Cluster Configuration --------------------- 4 NDB Node(s) DB node: 2 (not connected) DB node: 3 (not connected) DB node: 4 (not connected) DB node: 5 (not connected) 4 API Node(s) API node: 6 (not connected) API node: 7 (not connected) API node: 8 (not connected) API node: 9 (not connected) 1 MGM Node(s) MGM node: 1 (Version: 3.5.0) NDB>
[17 Aug 2004 22:33]
Justin Swanhart
What additional information can I provide? Would you like direct access to the machines that are having the problem so that you can investigate it more fully? Is there any way I can turn on more verbose debugging/tracing to determine exactly why the mgmt server can not see the ndb nodes?
[17 Aug 2004 22:51]
Jonas Oreland
Sorry for getting back to you more... Anyway, yes since I'm currently out of ideas, but you are apparently not the only one with the problem, getting access to you machine would be nice... w.r.t to debugging, there is no way without modifying the source to enable more debugging.
[17 Aug 2004 22:57]
Jonas Oreland
Sorry for getting back to you more... Anyway, yes since I'm currently out of ideas, but you are apparently not the only one with the problem, getting access to you machine would be nice... w.r.t to debugging, there is no way without modifying the source to enable more debugging.
[19 Aug 2004 14:44]
benoit plessis
Hello, I have more or less the same pb on a two computer cluster: Computer 1: - ndb_mgmd on node 1 - ndbd on node 2 Computer 2: - ndbd on node 3 NDB> show Cluster Configuration --------------------- 2 NDB Node(s) DB node: 2 (Version: 3.5.0) DB node: 3 (not connected) 4 API Node(s) API node: 11 (not connected) API node: 12 (not connected) API node: 13 (not connected) API node: 14 (not connected) 1 MGM Node(s) MGM node: 1 (Version: 3.5.0) on computer 1: Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State User Inode PID/Program name tcp 0 0 127.0.0.1:28002 0.0.0.0:* LISTEN 1008 141724 18627/ndbd tcp 0 0 127.0.0.1:28004 0.0.0.0:* LISTEN 1008 141709 18614/ndb_mgmd tcp 0 0 0.0.0.0:32792 0.0.0.0:* LISTEN 1008 141704 18614/ndb_mgmd tcp 0 0 0.0.0.0:2200 0.0.0.0:* LISTEN 1008 141703 18614/ndb_mgmd tcp 0 0 127.0.0.1:32795 127.0.0.1:2200 TIME_WAIT 0 0 - tcp 0 0 127.0.0.1:28003 127.0.0.1:32794 ESTABLISHED1008 141725 18614/ndb_mgmd tcp 0 0 127.0.0.1:32794 127.0.0.1:28003 ESTABLISHED1008 141723 18627/ndbd on computer 2 there are no open socket for ndbd, except for a very short time when contacting ndb_mgmd on port 2200.
[19 Aug 2004 16:48]
benoit plessis
Problem solved for me, i changed the HostName: localhost to HostName: <external-ipaddress> and everything goes ok
[20 Aug 2004 22:16]
Jonas Oreland
Hi, I just submitted a patch that performs *much* more vaildations before starting. If you could try it, and see if it gives you a clever error message :-)
[22 Aug 2004 2:06]
Justin Swanhart
I will do a bk pull on monday and let you know what I find.
[24 Aug 2004 3:44]
Justin Swanhart
ndb_mgmd now segfaults when I try to run it with the newest bk clone... Process 28382 resumed (parent 28381 ready) child_stack=0x4001db08, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|CLONE_DETACHED, parent_tidptr=0x4001dbf8, {entry_number:0, base_addr:0x4001dbb0, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0x4001dbf8) = 28382 [pid 28382] --- SIGSTOP (Stopped (signal)) @ 0 (0) --- [pid 28381] mmap2(NULL, 32768, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...> [pid 28382] --- SIGSEGV (Segmentation fault) @ 0 (0) --- Process 28381 detached Process 28382 detached