Description:
It seems that the 'address binding' in ndbd has a flaw. Initially, when the mgm and data nodes are started, everything seems fine:
Cluster Configuration
---------------------
[ndbd(NDB)] 2 node(s)
id=4 @10.0.12.168 (Version: 7.0.13, starting, Nodegroup: 0)
id=5 @10.0.12.169 (Version: 7.0.13, Nodegroup: 0, Master)
[ndb_mgmd(MGM)] 1 node(s)
id=1 @10.0.12.146 (Version: 7.0.13)
[mysqld(API)] 5 node(s)
id=2 (not connected, accepting connect from jessie)
id=3 (not connected, accepting connect from woody)
id=6 (not connected, accepting connect from nemo)
id=7 (not connected, accepting connect from ndb-api-01)
id=8 (not connected, accepting connect from ndb-api-02)
Here you can see both data nodes connected to the mgmd, using the IP addresses assigned to them (10.0.12.168 and 10.0.12.169). Both machines are have multiple addresses, this is from node 4:
frozone:/var/lib/mysql-cluster-ws# ip addr sh
1: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
link/ether 00:1b:78:72:0a:0e brd ff:ff:ff:ff:ff:ff
2: eth1: <BROADCAST,MULTICAST,UP,10000> mtu 1500 qdisc pfifo_fast qlen 1000
link/ether 00:1b:78:72:f9:fa brd ff:ff:ff:ff:ff:ff
inet 10.0.12.225/25 brd 10.0.12.255 scope global eth1
inet 10.0.12.168/25 brd 10.0.12.255 scope global secondary eth1:168
3: lo: <LOOPBACK,UP,10000> mtu 16436 qdisc noqueue
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
frozone:/var/lib/mysql-cluster-ws# ps x | grep ndb
30383 ? Ss 0:00 /opt/mysql/bin/ndbd -c nemo --bind-address 10.0.12.168
30384 ? Sl 0:02 /opt/mysql/bin/ndbd -c nemo --bind-address 10.0.12.168
30429 pts/0 S+ 0:00 grep ndb
Things go wrong when ndb_mgmd is restarted on node 1, both data nodes suddenly ignore the address they're bound and re-establish connection to the mgm node on the primary IP (10.0.12.224 for node 4, 10.0.12.225 for node 5):
ndb_mgm> quit
geoffrey@nemo:~$ sudo /etc/init.d/mysql-ndb-mgm restart
Stopping MySQL NDB cluster management server: ndb_mgmd.
Starting MySQL NDB cluster management server: ndb_mgmd2010-03-25 15:47:54 [/var/log/mysql/mysql-cluster.log] INFO -- NDB Cluster Management Server. mysql-5.1.41 ndb-7.0.13
2010-03-25 15:47:54 [/var/log/mysql/mysql-cluster.log] INFO -- Loaded config from '/var/lib/mysql-mgmd/config/ndb_1_config.bin.4'
.
geoffrey@nemo:~$ ndb_mgm
-- NDB Cluster -- Management Client --
ndb_mgm> SHOW
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)] 2 node(s)
id=4 @10.0.12.225 (Version: 7.0.13, Nodegroup: 0)
id=5 @10.0.12.224 (Version: 7.0.13, Nodegroup: 0, Master)
[ndb_mgmd(MGM)] 1 node(s)
id=1 @10.0.12.146 (Version: 7.0.13)
[mysqld(API)] 5 node(s)
id=2 (not connected, accepting connect from jessie)
id=3 (not connected, accepting connect from woody)
id=6 (not connected, accepting connect from nemo)
id=7 (not connected, accepting connect from ndb-api-01)
id=8 (not connected, accepting connect from ndb-api-02)
ndb_mgm>
This is data node 4 again:
frozone:/var/lib/mysql-cluster-ws# ps x | grep ndb
30383 ? Ss 0:00 /opt/mysql/bin/ndbd -c nemo --bind-address 10.0.12.168
30384 ? Sl 0:02 /opt/mysql/bin/ndbd -c nemo --bind-address 10.0.12.168
30435 pts/0 S+ 0:00 grep ndb
frozone:/var/lib/mysql-cluster-ws# lsof -p 30384 -n | grep IPv4
ndbd 30384 root 4u IPv4 662188027 TCP 10.0.12.225:59453->10.0.12.146:1186 (ESTABLISHED)
ndbd 30384 root 8u IPv4 662187858 TCP 10.0.12.168:43369 (LISTEN)
ndbd 30384 root 9u IPv4 662187860 TCP 10.0.12.168:55196 (LISTEN)
ndbd 30384 root 10u IPv4 662187862 TCP 10.0.12.168:50450 (LISTEN)
ndbd 30384 root 11u IPv4 662187864 TCP 10.0.12.168:39031 (LISTEN)
ndbd 30384 root 12u IPv4 662187866 TCP 10.0.12.168:55418 (LISTEN)
ndbd 30384 root 13u IPv4 662187868 TCP 10.0.12.168:46324 (LISTEN)
ndbd 30384 root 16u IPv4 662187871 TCP 10.0.12.168:43369->10.0.12.169:40889 (ESTABLISHED)
The first line from the lsof output shows that node 4 (data node) and node 1 are suddenly using the 10.0.12.225 address instead of the address the ndbd was bound to (10.0.12.168).
This is the config.ini for the cluster (i've replaced hostnames with IP addresses for your convenience):
geoffrey@nemo:/var/lib/mysql-mgmd$ egrep -v '^#' /etc/mysql/ndb_mgmd.cnf
[NDBD DEFAULT]
NoOfReplicas= 2
[NDB_MGMD]
HostName= 10.0.12.146
DataDir= /var/lib/mysql-mgmd
Id= 1
[MYSQLD]
Id= 6
HostName= 10.0.12.146
[MYSQLD]
Id= 2
HostName= 10.0.12.152
[MYSQLD]
Id= 3
HostName= 10.0.12.151
[NDBD]
Id= 4
HostName= 10.0.12.168
DataMemory=1024M
IndexMemory=1700M
DataDir= /var/lib/mysql-cluster-ws
BackupDataDir= /vol/backup/ws-cluster
[NDBD]
Id= 5
HostName= 10.0.12.169
DataMemory=1024M
IndexMemory=1700M
DataDir= /var/lib/mysql-cluster-ws
BackupDataDir= /vol/backup/ws-cluster
[MYSQLD]
Id= 7
HostName= 10.0.12.168
[MYSQLD]
Id= 8
HostName= 10.0.12.169
How to repeat:
- Start ndb_mgmd
- Start ndbd, bound to any IP alias
- Verify the ndbd is using the assign IP address
- Restart ndb_mgmd
- Any ndbd's have re-established communication with MGM node using another IP address then the one bound to
Suggested fix:
Make sure ndbd always respects the bind-address parameter when its re-connecting to the ndb_mgmd.