Bug #81952 cluster hangs in start phase 2
Submitted: 21 Jun 2016 11:45 Modified: 21 Jun 2016 21:14
Reporter: Christian Tölle Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Cluster: NDB API Severity:S1 (Critical)
Version: OS:Ubuntu (Ubuntu 14.04.4 LTS)
Assigned to: MySQL Verification Team CPU Architecture:Any

[21 Jun 2016 11:45] Christian Tölle
Description:
tested with
mysql-cluster-gpl-7.3.13-debian7-x86_64.deb  mysql-cluster-gpl-7.4.11-debian6.0-x86_64.deb  mysql-cluster-gpl-7.4.11-debian7-x86_64.deb

- Maschine has 4 cores and is hyperthreaded to 8 cores.

In a simple Testsetup in one host the cluster does't comes out of phase 2 for around 20 hours. (started with --initial)

After that, the ndb_x_fs folders are each 210 GB big.

The ndbd process writes all the time with about 8MB/s to disc. 

the event log is filled with entry's like this.

2016-06-20 22:26:06 Node 3: Operations=0
2016-06-20 22:26:06 Node 2: Operations=0
2016-06-20 22:26:11 Node 3: Operations=0
2016-06-20 22:26:11 Node 2: Operations=0
2016-06-20 22:26:16 Node 3: Operations=0
2016-06-20 22:26:16 Node 2: Operations=0
2016-06-20 22:26:21 Node 3: Operations=0
2016-06-20 22:26:21 Node 2: Operations=0
2016-06-20 22:26:26 Node 3: Operations=0
2016-06-20 22:26:26 Node 2: Operations=0
2016-06-20 22:26:31 Node 3: Operations=0
2016-06-20 22:26:31 Node 2: Operations=0
2016-06-20 22:26:32 Node 2: Mean loop Counter in doJob last 8192 times = 6
2016-06-20 22:26:35 Node 2: 8192 loops,tot 48634297 usec,exec 11 extra:loops = 573,time 7,const 50
2016-06-20 22:26:36 Node 3: Operations=0

ndb_mgmd log

2016-06-20 13:57:38 [MgmtSrvr] INFO -- Node 2: Node 3: API mysql-5.6.29 ndb-7.4.11
2016-06-20 13:57:38 [MgmtSrvr] INFO -- Node 3: Node 2: API mysql-5.6.29 ndb-7.4.11
2016-06-20 13:57:38 [MgmtSrvr] INFO -- Node 3: Start phase 1 completed
2016-06-20 13:57:38 [MgmtSrvr] INFO -- Node 2: Start phase 1 completed
2016-06-20 13:57:38 [MgmtSrvr] INFO -- Node 2: System Restart: master node: 2, num starting: 2, gci: 0
2016-06-20 13:57:38 [MgmtSrvr] INFO -- Node 2: CNTR_START_CONF: started: 0000000000000000
2016-06-20 13:57:38 [MgmtSrvr] INFO -- Node 2: CNTR_START_CONF: starting: 000000000000000c
2016-06-20 13:57:38 [MgmtSrvr] INFO -- Node 2: Local redo log file initialization status:
#Total files: 400000, Completed: 0
#Total MBytes: 6400000, Completed: 0
2016-06-20 13:57:38 [MgmtSrvr] INFO -- Node 3: Local redo log file initialization status:
#Total files: 400000, Completed: 0
#Total MBytes: 6400000, Completed: 0

ndb2 log

2016-06-20 13:57:32 [ndbd] INFO -- Clearing filesystem in initial start
2016-06-20 13:57:32 [ndbd] INFO -- Start phase 0 completed
2016-06-20 13:57:32 [ndbd] INFO -- Phase 0 has made some file system initialisations
2016-06-20 13:57:32 [ndbd] INFO -- Starting QMGR phase 1
2016-06-20 13:57:32 [ndbd] INFO -- DIH reported initial start, now starting the Node Inclusion Protocol
2016-06-20 13:57:38 [ndbd] INFO -- findNeighbours from: 2248 old (left: 65535 right: 65535) new (3 3)
2016-06-20 13:57:38 [ndbd] INFO -- Include node protocol completed, phase 1 in QMGR completed
2016-06-20 13:57:38 [ndbd] INFO -- Start phase 1 completed
2016-06-20 13:57:38 [ndbd] INFO -- Phase 1 initialised some variables and included node in cluster, locked memory if configured to do so
2016-06-20 13:57:38 [ndbd] INFO -- Asking master node to accept our start (we are master, GCI = 0)
2016-06-20 13:57:38 [ndbd] INFO -- NDBCNTR master accepted us into cluster, start NDB start phase 1
2016-06-20 13:57:38 [ndbd] INFO -- We are performing initial start of cluster
2016-06-20 13:57:38 [ndbd] INFO -- LDM(0): Starting REDO log initialisation

ndb3 log

2016-06-20 13:57:35 [ndbd] INFO -- Clearing filesystem in initial start
2016-06-20 13:57:35 [ndbd] INFO -- Start phase 0 completed
2016-06-20 13:57:35 [ndbd] INFO -- Phase 0 has made some file system initialisations
2016-06-20 13:57:35 [ndbd] INFO -- Starting QMGR phase 1
2016-06-20 13:57:35 [ndbd] INFO -- DIH reported initial start, now starting the Node Inclusion Protocol
2016-06-20 13:57:38 [ndbd] INFO -- findNeighbours from: 2336 old (left: 65535 right: 65535) new (2 2)
2016-06-20 13:57:38 [ndbd] INFO -- Include node protocol completed, phase 1 in QMGR completed
2016-06-20 13:57:38 [ndbd] INFO -- Start phase 1 completed
2016-06-20 13:57:38 [ndbd] INFO -- Phase 1 initialised some variables and included node in cluster, locked memory if configured to do so
2016-06-20 13:57:38 [ndbd] INFO -- Asking master node to accept our start (nodeId = 2 is master), GCI = 0
2016-06-20 13:57:38 [ndbd] INFO -- NDBCNTR master accepted us into cluster, start NDB start phase 1
2016-06-20 13:57:38 [ndbd] INFO -- We are performing initial start of cluster
2016-06-20 13:57:38 [ndbd] INFO -- LDM(0): Starting REDO log initialisation

config:

[NDBD DEFAULT]

NoOfReplicas = 2
MaxNoOfAttributes = 40000
MaxNoOfTables = 1000
DataMemory = 8GB # 80M is default
IndexMemory = 512M # 18M is default
DataDir = /var/mysql-cluster
NoOfFragmentLogFiles = 100000
#LockPagesInMainMemory = 1 # Make sure not to use swap

[NDB_MGMD DEFAULT]
DataDir = /var/mysql-cluster

[NDB_MGMD]
NodeId=1
HostName = localhost

[NDBD]
NodeId = 2
HostName = localhost
MaxNoOfOrderedIndexes =512
MaxNoOfConcurrentOperations=2000000

[NDBD]
NodeId = 3
HostName = localhost
MaxNoOfOrderedIndexes =512
MaxNoOfConcurrentOperations=2000000

[MYSQLD]
NodeId = 4
[MYSQLD]
NodeId = 5
[API]

How to repeat:
install 

./ndb_mgmd --reload -f /etc/mysql/cluster.ini
./ndbd --initial
./ndbd --initial

Suggested fix:
no idea
[21 Jun 2016 11:55] Christian Tölle
i also try'd to completly remove the ndb_X_fs folders. also with no sucesss..
[21 Jun 2016 20:53] Christian Tölle
can be closed! it's not a bug.. 

NoOfFragmentLogFiles = 100000 
was set a way much to high..