MySQL Bugs: #47136: ndb_mgm> X RESTART does not work on clusters with lots of tables

Bug #47136	ndb_mgm> X RESTART does not work on clusters with lots of tables
Submitted:	4 Sep 2009 13:08	Modified:	19 Jan 2016 13:55
Reporter:	Johan Andersson	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S4 (Feature request)
Version:	mysql-5.1-telco-7.0	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	7.0.7, NDB_MGM, restart

Description:
(i guess this is really the problem in bug reports:
http://bugs.mysql.com/bug.php?id=26955
and
http://bugs.mysql.com/bug.php?id=46047 )

* Two data nodes cluster (id=3 and 4)
* two management servers (id=1 and 2)

* created 2048 tables in cluster (wanted to create more, but forgot to increase the MaxNoOfOrderedIndexes to more than 2048):
* MaxNoOfTables=20320
* MaxNoOfOrderedIndexes=2048

Then I wanted to restart node 3:
ndb_mgm> 3 restart

Failure handling of node 3 takes a loooong time (4-5 minutes):

2009-09-04 14:54:20 [MgmSrvr] INFO     -- Going to stop node 3
2009-09-04 14:54:20 [MgmSrvr] INFO     -- Node 3: Node shutdown initiated
2009-09-04 14:54:24 [MgmSrvr] INFO     -- Node 3: Data usage is 0%(72 32K pages of total 65536)
2009-09-04 14:54:25 [MgmSrvr] INFO     -- Node 4: Data usage is 0%(72 32K pages of total 65536)
2009-09-04 14:54:31 [MgmSrvr] INFO     -- Node 3: Node shutdown completed, restarting, no start.
2009-09-04 14:54:31 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2009-09-04 14:54:32 [MgmSrvr] ALERT    -- Node 4: Node 3 Disconnected
2009-09-04 14:54:32 [MgmSrvr] ALERT    -- Node 4: Network partitioning - arbitration required
2009-09-04 14:54:32 [MgmSrvr] INFO     -- Node 4: President restarts arbitration thread [state=7]
2009-09-04 14:54:32 [MgmSrvr] INFO     -- Node 4: Communication to Node 3 closed
2009-09-04 14:54:32 [MgmSrvr] ALERT    -- Node 4: Arbitration won - positive reply from node 1
2009-09-04 14:54:32 [MgmSrvr] INFO     -- Node 4: GCP Take over started
2009-09-04 14:54:32 [MgmSrvr] INFO     -- Node 4: Node 4 taking over as DICT master
2009-09-04 14:54:32 [MgmSrvr] INFO     -- Node 4: GCP Take over completed
2009-09-04 14:54:32 [MgmSrvr] INFO     -- Node 4: kk: 1802/9 0 0
2009-09-04 14:54:32 [MgmSrvr] INFO     -- Node 4: LCP Take over started
2009-09-04 14:54:32 [MgmSrvr] INFO     -- Node 4: ParticipatingDIH = 0000000000000010
2009-09-04 14:54:32 [MgmSrvr] INFO     -- Node 4: ParticipatingLQH = 0000000000000010
2009-09-04 14:54:32 [MgmSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2009-09-04 14:54:32 [MgmSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=1 0000000000000010]
2009-09-04 14:54:32 [MgmSrvr] INFO     -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2009-09-04 14:54:32 [MgmSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 0
2009-09-04 14:54:32 [MgmSrvr] INFO     -- Node 4: LCP Take over completed (state = 5)
2009-09-04 14:54:32 [MgmSrvr] INFO     -- Node 4: ParticipatingDIH = 0000000000000010
2009-09-04 14:54:32 [MgmSrvr] INFO     -- Node 4: ParticipatingLQH = 0000000000000010
2009-09-04 14:54:32 [MgmSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=1 0000000000000010]
2009-09-04 14:54:32 [MgmSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=1 0000000000000010]
2009-09-04 14:54:32 [MgmSrvr] INFO     -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=1 0000000000000010]
2009-09-04 14:54:32 [MgmSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 0
2009-09-04 14:54:32 [MgmSrvr] INFO     -- Node 4: Started arbitrator node 1 [ticket=2c550002e2239c1e]
2009-09-04 14:55:31 [MgmSrvr] WARNING  -- Node 4: Failure handling of node 3 has not completed in 1 min. - state = 3
2009-09-04 14:56:31 [MgmSrvr] WARNING  -- Node 4: Failure handling of node 3 has not completed in 2 min. - state = 3

On host with id=3 i do ps -ef |grep ndbmtd

root     27191     1  0 14:22 ?        00:00:00 /usr/local/mysql//mysql/bin//ndbmtd --ndb-nodeid=3 -c ps-ndb01:1186;ps-ndb02:1186 --initial
root     28947 28828  0 14:56 pts/0    00:00:00 grep ndbmtd

2009-09-04 14:57:32 [MgmSrvr] WARNING  -- Node 4: Failure handling of node 3 has not completed in 3 min. - state = 3

2009-09-04 14:58:32 [MgmSrvr] WARNING  -- Node 4: Failure handling of node 3 has not completed in 4 min. - state = 3

2009-09-04 14:58:59 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 reserved for ip 192.9.73.15, m_reserved_nodes 000000000000000000000000000000000000000000000000000000000000078a.
2009-09-04 14:59:00 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000000000782.
2009-09-04 14:59:00 [MgmSrvr] INFO     -- Node 4: Communication to Node 3 opened

Because it has taken so long the node three that i wanted to restart has no completely given up:
[root@ps-ndb05 ~]# ps -ef |grep ndbmtd
root     29398 28828  0 15:02 pts/0    00:00:00 grep ndbmtd

Thus, the restart failed

How to repeat:
Create a lot of tables (~2048) so that node failure handling takes a long time to complete.

Restart one of the data nodes with ndb_mgm> x restart

Suggested fix:
It would be great if the data node wouldn't timeout.

Hi,
Perhaps it would be possible to add to the data nodes:

--retry-time
--retry-count

That is, the data node should try to connect to the management server of --retry-count number of times and --retry-time between the retries.

Verified as still actual, set as feature request

all best
Bogdan Kecman