MySQL Bugs: #81506: ndbd always be in "starting" state after i run it as "--initial"

Bug #81506	ndbd always be in "starting" state after i run it as "--initial"
Submitted:	19 May 2016 12:08	Modified:	24 May 2016 19:34
Reporter:	Jingyu Hu	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-cluster-gpl-7.4.11-linux-glibc2.5-	OS:	CentOS (7.0)
Assigned to:	MySQL Verification Team	CPU Architecture:	Any
Tags:	ndbd starting

Description:
I want to update "/var/lib/mysql-cluster/config.ini" to change options as:
FragmentLogFileSize=256M
RedoBuffer=128M

so i do the steps:
1. stop one ndbd whose nodeid is 2
2. stop the management node whose nodeid is 1
3. run "/usr/local/mysql/bin/ndb_mgmd --config-file /var/lib/mysql-cluster/config.ini --reload --initial"
4. login the server of nodeid 2 who is the ndbd
5. run "/usr/local/mysql/bin/ndbd --initial"
6. login the server of management node, run "/usr/local/mysql/bin/ndb_mgm"
7. i check node status as:
   ndb_mgm> show
8. i see the ndbd node always be in starting:
----------------------------------------------
-- NDB Cluster -- Management Client --
ndb_mgm> show
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=2    @10.174.9.212  (mysql-5.6.29 ndb-7.4.11, starting, Nodegroup: 0)
id=3    @10.45.149.44  (mysql-5.6.29 ndb-7.4.11, Nodegroup: 0, *)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @10.174.8.175  (mysql-5.6.29 ndb-7.4.11)

[mysqld(API)]   2 node(s)
id=4    @10.174.9.248  (mysql-5.6.29 ndb-7.4.11)
id=5    @10.174.9.185  (mysql-5.6.29 ndb-7.4.11)
---------------------

BTW:
1. this issue happened in ndb-7.4.8
2. i could not resolve this issue in ndb-7.4.8, so i upgrade it to ndb-7.4.11
3. this issue still happened in ndb-7.4.11

I attached ndb_error_report to this bug, please help me to resolve this issue.

Thank you very much.

How to repeat:
I want to update "/var/lib/mysql-cluster/config.ini" to change options as:
FragmentLogFileSize=256M
RedoBuffer=128M

so i do the steps:
1. stop one ndbd whose nodeid is 2
2. stop the management node whose nodeid is 1
3. run "/usr/local/mysql/bin/ndb_mgmd --config-file /var/lib/mysql-cluster/config.ini --reload --initial"
4. login the server of nodeid 2 who is the ndbd
5. run "/usr/local/mysql/bin/ndbd --initial"
6. login the server of management node, run "/usr/local/mysql/bin/ndb_mgm"
7. i check node status as:
   ndb_mgm> show
8. i see the ndbd node always be in starting:
----------------------------------------------
-- NDB Cluster -- Management Client --
ndb_mgm> show
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=2    @10.174.9.212  (mysql-5.6.29 ndb-7.4.11, starting, Nodegroup: 0)
id=3    @10.45.149.44  (mysql-5.6.29 ndb-7.4.11, Nodegroup: 0, *)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @10.174.8.175  (mysql-5.6.29 ndb-7.4.11)

[mysqld(API)]   2 node(s)
id=4    @10.174.9.248  (mysql-5.6.29 ndb-7.4.11)
id=5    @10.174.9.185  (mysql-5.6.29 ndb-7.4.11)
---------------------

BTW:
1. this issue happened in ndb-7.4.8
2. i could not resolve this issue in ndb-7.4.8, so i upgrade it to ndb-7.4.11
3. this issue still happened in ndb-7.4.11

The files collected by ndb_error_reporter

Attachment: ndb_error_report_20160519200515.tar.bz2 (application/octet-stream, text), 437.12 KiB.

Hi,
how big is your database?
what are the io, network and cpu usage's on this host?
are you having traffic on the cluster while doing this?
do you have long running transactions? If you have a long running transaction it can block this node to get up.

These are just some basic questions before I go trough uploaded logs. Will get back to you after I look at logs.

kind regards
Bogdan Kecman

Hi,
I don't see anything wrong in the logs, looks like your database is just big and takes time to transfer data from another node and then rebuild all the indexes..

2016-05-19 19:42:35 [MgmtSrvr] INFO     -- Node 2: Start phase 2 completed (initial node restart)
2016-05-19 19:42:36 [MgmtSrvr] INFO     -- Node 2: Start phase 3 completed (initial node restart)
2016-05-19 19:42:36 [MgmtSrvr] INFO     -- Node 2: Start phase 4 completed (initial node restart)
2016-05-19 19:42:38 [MgmtSrvr] INFO     -- Node 2: Starting to restore schema
2016-05-19 19:42:40 [MgmtSrvr] INFO     -- Node 2: Restore of schema complete
2016-05-19 19:42:40 [MgmtSrvr] INFO     -- Node 2: DICT: activate index 6 done (sys/def/5/ndb_index_stat_sample_x1)
...
2016-05-19 19:42:43 [MgmtSrvr] INFO     -- Node 2: DICT: activate index 219 done (sys/def/218/PRIMARY)
2016-05-19 19:42:43 [MgmtSrvr] INFO     -- Node 3: Node restart completed copy of dictionary information to Node 2
2016-05-19 19:42:43 [MgmtSrvr] INFO     -- Node 3: NR Status: node=2,OLD=Copy meta data to start node,NEW=Include node in LCP/GCP protocols
2016-05-19 19:42:43 [MgmtSrvr] INFO     -- Node 3: NR Status: node=2,OLD=Include node in LCP/GCP protocols,NEW=Restore fragments ongoing
2016-05-19 19:42:43 [MgmtSrvr] INFO     -- Node 2: Node restart starting to copy the fragments to Node 2
2016-05-19 19:42:43 [MgmtSrvr] INFO     -- Node 2: Restore Database Off-line Starting on node 2
2016-05-19 19:42:43 [MgmtSrvr] INFO     -- Node 2: Node: 2 StartLog: [GCI Keep: 5247602 LastCompleted: 5247602 NewestRestorable: 5300604]
2016-05-19 19:42:43 [MgmtSrvr] INFO     -- Node 2: LDM instance 0: Restored LCP : 0 fragments, 0 rows, 0 millis, 0 rows/s
2016-05-19 19:42:43 [MgmtSrvr] INFO     -- Node 3: NR Status: node=2,OLD=Restore fragments ongoing,NEW=Undo Disk data ongoing
2016-05-19 19:42:43 [MgmtSrvr] INFO     -- Node 2: Restore Database Off-line Completed on node 2
2016-05-19 19:42:43 [MgmtSrvr] INFO     -- Node 2: Bring Database On-line Starting on node 2

so according to logs, everything is fine it just takes a while to get the system up.

all best
Bogdan Kecman