MySQL Bugs: #55203: NDB node keep starting state for long time after changes on config.ini

Bug #55203	NDB node keep starting state for long time after changes on config.ini
Submitted:	13 Jul 2010 6:46	Modified:	16 Sep 2010 11:13
Reporter:	Jabba Jabba	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-5.1-telco-7.1	OS:	Linux (Red Hat Enterprise Linux Server release 5.4)
Assigned to:		CPU Architecture:	Any
Tags:	7.1.4b

Description:
We are using mysql cluster 7.1.4b having following configuration under Red Hat Enterprise Linux Server release 5.4: 

Management Node: 

[ndbd default] 
NoOfReplicas=2 # Number of replicas 
DataMemory=1536M 
IndexMemory=100M 
RedoBuffer=256M 
[tcp default] 

[ndb_mgmd] 
hostname=10.0.104.33 
datadir=/var/lib/mysql-cluster 

[ndbd] 
hostname=10.0.104.22 
datadir=/usr/local/mysql/data 
[ndbd] 
hostname=10.0.104.23 
datadir=/usr/local/mysql/data 
[mysqld] 
[mysqld] 

My.cnf: 
[mysqld] 
ndbcluster 
ndb-connectstring=10.0.104.33 
[mysql_cluster] 
ndb-connectstring=10.0.104.33 

------------------------------------------------------ 
When we start the ndbd, it keeps on the starting status:[ndbd(NDB)] 2 node(s) 
id=2 @10.0.104.22 (mysql-5.1.44 ndb-7.1.4, starting, Nodegroup: 0) 
id=3 (not connected, accepting connect from 10.0.104.23) 

[ndb_mgmd(MGM)] 1 node(s) 
id=1 @10.0.104.33 (mysql-5.1.44 ndb-7.1.4) 

[mysqld(API)] 2 node(s) 
id=4 (not connected, accepting connect from any host) 
id=5 (not connected, accepting connect from any host) 

Checking the cluster log, found the following message: 
2010-07-12 22:19:11 [MgmtSrvr] INFO -- Node 2: Initial start, waiting for 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait: ] 
2010-07-12 22:19:14 [MgmtSrvr] INFO -- Node 2: Initial start, waiting for 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait: ] 
2010-07-12 22:19:17 [MgmtSrvr] INFO -- Node 2: Initial start, waiting for 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait: ] 
2010-07-12 22:19:20 [MgmtSrvr] INFO -- Node 2: Initial start, waiting for 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait: ] 
2010-07-12 22:19:23 [MgmtSrvr] INFO -- Node 2: Initial start, waiting for 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait: ] 
2010-07-12 22:19:26 [MgmtSrvr] INFO -- Node 2: Initial start, waiting for 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait: ] 
2010-07-12 22:19:29 [MgmtSrvr] INFO -- Node 2: Initial start, waiting for 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait: ] 
2010-07-12 22:19:32 [MgmtSrvr] INFO -- Node 2: Initial start, waiting for 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait: ] 
2010-07-12 22:19:35 [MgmtSrvr] INFO -- Node 2: Initial start, waiting for 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait: ] 
2010-07-12 22:19:38 [MgmtSrvr] INFO -- Node 2: Initial start, waiting for 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait: ] 

[gcdldapro@agcdtldbr04 mysql-cluster]$ vi ndb_1_cluster.log 
2010-07-12 21:49:59 [MgmtSrvr] INFO -- Node 2: Initial start, waiting for 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait: ] 
2010-07-12 21:50:00 [MgmtSrvr] INFO -- Node 3: Initial start, waiting for 2 to connect, nodes [ all: 2 and 3 connected: 3 no-wait: ] 
2010-07-12 21:50:02 [MgmtSrvr] INFO -- Node 2: Initial start, waiting for 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait: ] 
2010-07-12 21:50:03 [MgmtSrvr] INFO -- Node 3: Initial start, waiting for 2 to connect, nodes [ all: 2 and 3 connected: 3 no-wait: ] 
2010-07-12 21:50:05 [MgmtSrvr] INFO -- Node 2: Initial start, waiting for 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait: ] 
2010-07-12 21:50:07 [MgmtSrvr] INFO -- Node 3: Initial start, waiting for 2 to connect, nodes [ all: 2 and 3 connected: 3 no-wait: ] 
2010-07-12 21:50:08 [MgmtSrvr] INFO -- Node 2: Initial start, waiting for 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait: ] 
2010-07-12 21:50:10 [MgmtSrvr] INFO -- Node 3: Initial start, waiting for 2 to connect, nodes [ all: 2 and 3 connected: 3 no-wait: ] 

For NDB log: 
2010-07-12 21:54:22 [ndbd] INFO -- NDB Cluster -- DB node 2 
2010-07-12 21:54:22 [ndbd] INFO -- mysql-5.1.44 ndb-7.1.4b -- 
2010-07-12 21:54:22 [ndbd] INFO -- Ndbd_mem_manager::init(1) min: 1896Mb initial: 1916Mb 
Adding 1917Mb to ZONE_LO (1,61319) 
2010-07-12 21:54:24 [ndbd] INFO -- Start initiated (mysql-5.1.44 ndb-7.1.4) 
WARNING: timerHandlingLab now: 38480895 sent: 38480810 diff: 85 
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer 
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer 
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer 
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer 
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer 
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer 
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer 
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer 
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer 
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer 
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer 
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer 
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer 
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer 
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer 
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer 
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer 
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer 
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer 
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer 
WOPool::init(61, 9) 
RWPool::init(22, 13) 
RWPool::init(42, 18) 
RWPool::init(62, 13) 
Using 1 fragments per node 
WARNING: timerHandlingLab now: 38481065 sent: 38481015 diff: 50 
RWPool::init(c2, 18) 
RWPool::init(e2, 14) 
WOPool::init(41, 7) 
RWPool::init(82, 12) 
RWPool::init(a2, 53) 
WOPool::init(21, 6) 
2010-07-12 21:54:24 [ndbd] INFO -- Start phase 0 completed 

P.S. There is no ERROR message found from the log file 

Our NDB environment cannot work right now

How to repeat:
Modify the management node's config.ini and add the following two parameters:
StartPartialTimeout=30000
StartPartitionedTimeout=60000

Restart management node with new config and restart ndbd node will hit this problem. Even if we remove the two parameters and restart again, problem still occur.

We have performed further testing, no matter we have change any parameter in the config.ini and then execute ndb_mgmd -f /var/lib/mysql-cluster/config.ini --reload

The problem does occur. If we remove the ndb_1_config.bin.2 and use the ndb_1_config.bin.1 for starting the ndbd. It back to normal state. 

Actually what is the root cause for the ndbd node keep waiting for the other node?
2010-07-13 16:42:21 [MgmtSrvr] INFO     -- Node 2: Initial start, waiting for 3 to connect,  nodes [ all: 2 and 3 connected: 2 no-wait:  ]
2010-07-13 16:42:24 [MgmtSrvr] INFO     -- Node 2: Initial start, waiting for 3 to connect,  nodes [ all: 2 and 3 connected: 2 no-wait:  ]
2010-07-13 16:42:27 [MgmtSrvr] INFO     -- Node 2: Initial start, waiting for 3 to connect,  nodes [ all: 2 and 3 connected: 2 no-wait:  ]
2010-07-13 16:42:30 [MgmtSrvr] INFO     -- Node 2: Initial start, waiting for 3 to connect,  nodes [ all: 2 and 3 connected: 2 no-wait:  ]
2010-07-13 16:42:33 [MgmtSrvr] INFO     -- Node 2: Initial start, waiting for 3 to connect,  nodes [ all: 2 and 3 connected: 2 no-wait:  ]

For normal startup, it will shows:
2010-07-11 16:00:52 [MgmtSrvr] INFO     -- Node 2: Waiting 30 sec for nodes 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait:  ]
2010-07-11 16:00:55 [MgmtSrvr] INFO     -- Node 2: Waiting 27 sec for nodes 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait:  ]
2010-07-11 16:00:58 [MgmtSrvr] INFO     -- Node 2: Waiting 24 sec for nodes 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait:  ]
2010-07-11 16:01:01 [MgmtSrvr] INFO     -- Node 2: Waiting 21 sec for nodes 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait:  ]
2010-07-11 16:01:04 [MgmtSrvr] INFO     -- Node 2: Waiting 18 sec for nodes 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait:  ]
2010-07-11 16:01:07 [MgmtSrvr] INFO     -- Node 2: Waiting 15 sec for nodes 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait:  ]
2010-07-11 16:01:10 [MgmtSrvr] INFO     -- Node 2: Waiting 12 sec for nodes 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait:  ]
2010-07-11 16:01:13 [MgmtSrvr] INFO     -- Node 2: Waiting 9 sec for nodes 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait:  ]

How exactly do you start the data node(s)?

The log snippets you provided indicate that you are
using "ndbd --initial", and with --initial neither 
StartPartialTimeout nor StartPartitionedTimeout
take any effect, they are treated as if they were
set to zero and nodes will not start after timeout
but will wait for all other nodes to show up
indefinitely.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

I have the same experience. Does anyone knows how to fix this? Thanks.

Version: mysql-5.7.18 ndb-7.5.6
OS: CentOS Linux release 7.3.1611

MGM Node:
[ndb_mgm display]
ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=2    @192.168.50.53  (mysql-5.7.18 ndb-7.5.6, starting, Nodegroup: 0)
id=3    @192.168.50.55  (mysql-5.7.18 ndb-7.5.6, Nodegroup: 0)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @192.168.50.51  (mysql-5.7.18 ndb-7.5.6)

[mysqld(API)]   2 node(s)
id=4    @192.168.50.52  (mysql-5.7.18 ndb-7.5.6)
id=5    @192.168.50.54  (mysql-5.7.18 ndb-7.5.6)

[logfile: /var/lib/mysql-cluster/ndb_1_cluster.log]
2017-05-09 20:07:50 [MgmtSrvr] ALERT    -- Node 2: Forced node shutdown completed. Occured during startphase 0. Initiated by signal 9.
2017-05-09 20:07:50 [MgmtSrvr] ALERT    -- Node 1: Node 2 Disconnected
2017-05-09 20:07:52 [MgmtSrvr] INFO     -- Node 3: NR Status: node=2,OLD=Allocated node id,NEW=Allocated node id
2017-05-09 20:07:52 [MgmtSrvr] INFO     -- Alloc node id 2 succeeded
2017-05-09 20:07:52 [MgmtSrvr] INFO     -- Nodeid 2 allocated for NDB at 192.168.50.53
2017-05-09 20:07:52 [MgmtSrvr] INFO     -- Node 1: Node 2 Connected
2017-05-09 20:07:56 [MgmtSrvr] INFO     -- Node 2: Initial start, waiting for 3 to connect,  nodes [ all: 2 and 3 connected: 2 no-wait:  ]
2017-05-09 20:07:59 [MgmtSrvr] INFO     -- Node 2: Initial start, waiting for 3 to connect,  nodes [ all: 2 and 3 connected: 2 no-wait:  ]
2017-05-09 20:08:02 [MgmtSrvr] INFO     -- Node 2: Initial start, waiting for 3 to connect,  nodes [ all: 2 and 3 connected: 2 no-wait:  ]
2017-05-09 20:08:05 [MgmtSrvr] INFO     -- Node 2: Initial start, waiting for 3 to connect,  nodes [ all: 2 and 3 connected: 2 no-wait:  ]
2017-05-09 20:08:08 [MgmtSrvr] INFO     -- Node 2: Initial start, waiting for 3 to connect,  nodes [ all: 2 and 3 connected: 2 no-wait:  ]
2017-05-09 20:08:11 [MgmtSrvr] INFO     -- Node 2: Initial start, waiting for 3 to connect,  nodes [ all: 2 and 3 connected: 2 no-wait:  ]