Bug #52366 ndbd bind-address fails after mgmd restart
Submitted: 25 Mar 2010 15:00 Modified: 5 Oct 2016 22:58
Reporter: Geoffrey de Kleijn Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.1-telco-7.0 OS:Linux (Debian 5.0)
Assigned to: MySQL Verification Team CPU Architecture:Any
Tags: cluster mysql-5.1.41 ndb-7.0.13

[25 Mar 2010 15:00] Geoffrey de Kleijn
Description:
It seems that the 'address binding' in ndbd has a flaw. Initially, when the mgm and data nodes are started, everything seems fine:

Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=4    @10.0.12.168  (Version: 7.0.13, starting, Nodegroup: 0)
id=5    @10.0.12.169  (Version: 7.0.13, Nodegroup: 0, Master)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @10.0.12.146  (Version: 7.0.13)

[mysqld(API)]   5 node(s)
id=2 (not connected, accepting connect from jessie)
id=3 (not connected, accepting connect from woody)
id=6 (not connected, accepting connect from nemo)
id=7 (not connected, accepting connect from ndb-api-01)
id=8 (not connected, accepting connect from ndb-api-02)

Here you can see both data nodes connected to the mgmd, using the IP addresses assigned to them (10.0.12.168 and 10.0.12.169). Both machines are have multiple addresses, this is from node 4:

frozone:/var/lib/mysql-cluster-ws# ip addr sh
1: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
    link/ether 00:1b:78:72:0a:0e brd ff:ff:ff:ff:ff:ff
2: eth1: <BROADCAST,MULTICAST,UP,10000> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:1b:78:72:f9:fa brd ff:ff:ff:ff:ff:ff
    inet 10.0.12.225/25 brd 10.0.12.255 scope global eth1
    inet 10.0.12.168/25 brd 10.0.12.255 scope global secondary eth1:168
3: lo: <LOOPBACK,UP,10000> mtu 16436 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo

frozone:/var/lib/mysql-cluster-ws# ps x | grep ndb
30383 ?        Ss     0:00 /opt/mysql/bin/ndbd -c nemo --bind-address 10.0.12.168
30384 ?        Sl     0:02 /opt/mysql/bin/ndbd -c nemo --bind-address 10.0.12.168
30429 pts/0    S+     0:00 grep ndb

Things go wrong when ndb_mgmd is restarted on node 1, both data nodes suddenly ignore the address they're bound and re-establish connection to the mgm node on the primary IP (10.0.12.224 for node 4, 10.0.12.225 for node 5):

ndb_mgm> quit
geoffrey@nemo:~$ sudo /etc/init.d/mysql-ndb-mgm restart
Stopping MySQL NDB cluster management server: ndb_mgmd.
Starting MySQL NDB cluster management server: ndb_mgmd2010-03-25 15:47:54 [/var/log/mysql/mysql-cluster.log] INFO     -- NDB Cluster Management Server. mysql-5.1.41 ndb-7.0.13
2010-03-25 15:47:54 [/var/log/mysql/mysql-cluster.log] INFO     -- Loaded config from '/var/lib/mysql-mgmd/config/ndb_1_config.bin.4'

.
geoffrey@nemo:~$ ndb_mgm
-- NDB Cluster -- Management Client --
ndb_mgm> SHOW
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=4    @10.0.12.225  (Version: 7.0.13, Nodegroup: 0)
id=5    @10.0.12.224  (Version: 7.0.13, Nodegroup: 0, Master)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @10.0.12.146  (Version: 7.0.13)

[mysqld(API)]   5 node(s)
id=2 (not connected, accepting connect from jessie)
id=3 (not connected, accepting connect from woody)
id=6 (not connected, accepting connect from nemo)
id=7 (not connected, accepting connect from ndb-api-01)
id=8 (not connected, accepting connect from ndb-api-02)

ndb_mgm>

This is data node 4 again:

frozone:/var/lib/mysql-cluster-ws# ps x | grep ndb
30383 ?        Ss     0:00 /opt/mysql/bin/ndbd -c nemo --bind-address 10.0.12.168
30384 ?        Sl     0:02 /opt/mysql/bin/ndbd -c nemo --bind-address 10.0.12.168
30435 pts/0    S+     0:00 grep ndb
frozone:/var/lib/mysql-cluster-ws# lsof -p 30384 -n | grep IPv4
ndbd    30384 root    4u  IPv4  662188027                TCP 10.0.12.225:59453->10.0.12.146:1186 (ESTABLISHED)
ndbd    30384 root    8u  IPv4  662187858                TCP 10.0.12.168:43369 (LISTEN)
ndbd    30384 root    9u  IPv4  662187860                TCP 10.0.12.168:55196 (LISTEN)
ndbd    30384 root   10u  IPv4  662187862                TCP 10.0.12.168:50450 (LISTEN)
ndbd    30384 root   11u  IPv4  662187864                TCP 10.0.12.168:39031 (LISTEN)
ndbd    30384 root   12u  IPv4  662187866                TCP 10.0.12.168:55418 (LISTEN)
ndbd    30384 root   13u  IPv4  662187868                TCP 10.0.12.168:46324 (LISTEN)
ndbd    30384 root   16u  IPv4  662187871                TCP 10.0.12.168:43369->10.0.12.169:40889 (ESTABLISHED)

The first line from the lsof output shows that node 4 (data node) and node 1 are suddenly using the 10.0.12.225 address instead of the address the ndbd was bound to (10.0.12.168).

This is the config.ini for the cluster (i've replaced hostnames with IP addresses for your convenience):

geoffrey@nemo:/var/lib/mysql-mgmd$ egrep -v '^#' /etc/mysql/ndb_mgmd.cnf

[NDBD DEFAULT]
NoOfReplicas= 2

[NDB_MGMD]
HostName= 10.0.12.146
DataDir= /var/lib/mysql-mgmd
Id= 1

[MYSQLD]
Id= 6
HostName= 10.0.12.146

[MYSQLD]
Id= 2
HostName= 10.0.12.152

[MYSQLD]
Id= 3
HostName= 10.0.12.151

[NDBD]
Id= 4
HostName= 10.0.12.168
DataMemory=1024M
IndexMemory=1700M
DataDir= /var/lib/mysql-cluster-ws
BackupDataDir= /vol/backup/ws-cluster

[NDBD]
Id= 5
HostName= 10.0.12.169
DataMemory=1024M
IndexMemory=1700M
DataDir= /var/lib/mysql-cluster-ws
BackupDataDir= /vol/backup/ws-cluster

[MYSQLD]
Id= 7
HostName= 10.0.12.168

[MYSQLD]
Id= 8
HostName= 10.0.12.169

How to repeat:
- Start ndb_mgmd
- Start ndbd, bound to any IP alias
- Verify the ndbd is using the assign IP address
- Restart ndb_mgmd
- Any ndbd's have re-established communication with MGM node using another IP address then the one bound to

Suggested fix:
Make sure ndbd always respects the bind-address parameter when its re-connecting to the ndb_mgmd.
[25 Mar 2010 15:03] Geoffrey de Kleijn
Something similar is reported in http://bugs.mysql.com/bug.php?id=22195, but never resolved. The comment of Marc - A. Dahlhaus on 8 Dec 2006 15:51 describes what looks like the same issue ?
[5 Oct 2016 22:58] MySQL Verification Team
can't reproduce on any of the modern mccge releases (7.2.25, 7.4.12)