Bug #32023 ndb_mgmd is slow to repsond when no nodes are up
Submitted: 1 Nov 2007 9:51 Modified: 7 Feb 2008 8:32
Reporter: Magnus Blåudd Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version: OS:Any
Assigned to: Magnus Blåudd CPU Architecture:Any

[1 Nov 2007 9:51] Magnus Blåudd
Description:
The management server is slow to respond to the SHOW command when no nodes are yet up. This is because it sends heartbeat signal to zero nodes and then wait for up to 1 second for all of them to respond! Since no nodes are up, the heartbeat has not bent sent to any node and thus there should be no need to wait for any of them.

How to repeat:
With 2 nodes up:
msvensson@pilot:~/mysql/mysql-5.0-maint/mysql-test$ time ../ndb/src/mgmclient/ndb_mgm --connect-string=localhost:10125 -e show
Connected to Management Server at: localhost:10125
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=1    @127.0.0.1  (Version: 5.0.52, Nodegroup: 0, Master)
id=2    @127.0.0.1  (Version: 5.0.52, Nodegroup: 0)

[ndb_mgmd(MGM)] 1 node(s)
id=3    @127.0.0.1  (Version: 5.0.52)

[mysqld(API)]   4 node(s)
id=4    @127.0.01  (Version: 5.0.52)
id=5    @127.0.0.1  (Version: 5.0.52)
id=6 (not connected, accepting connect from any host)
id=7 (not connected, accepting connect from any host)

real    0m0.218s
user    0m0.000s
sys     0m0.004s

ndb_mgm>1 stop
ndb_mgm>2 stop -A

msvensson@pilot:~/mysql/mysql-5.0-maint/mysql-test$ time ../ndb/src/mgmclient/ndb_mgm --connect-string=localhost:10125 -e show
Connected to Management Server at: localhost:10125
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=1 (not connected, accepting connect from localhost)
id=2 (not connected, accepting connect from localhost)

[ndb_mgmd(MGM)] 1 node(s)
id=3    @localhost  (Version: 5.0.52)

[mysqld(API)]   4 node(s)
id=4 (not connected, accepting connect from any host)
id=5 (not connected, accepting connect from any host)
id=6 (not connected, accepting connect from any host)
id=7 (not connected, accepting connect from any host)

real    0m3.128s
user    0m0.000s
sys     0m0.000s

This takes three seconds!

Suggested fix:
Use the "waitForHBFromNodes" bitmap to see if it's necessary to wait for heartbeat.
[1 Nov 2007 10:08] Magnus Blåudd
The reson for this taking three seconds si that the ait for zero nodes to respond actually happens three times, since there is a call to 'updateStatus'(which forces the hb) in 'printNodeStatus' which is called three times in a row. Better to move the call to 'updateStatus' to just before we  call 'printNodeStatus'*3
[1 Nov 2007 10:33] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/36839

ChangeSet@1.2540, 2007-11-01 11:33:35+01:00, msvensson@pilot.mysql.com +2 -0
  Bug#32023 ndb_mgmd is slow to repsond when no nodes are up
[1 Nov 2007 10:34] Magnus Blåudd
msvensson@pilot:~/mysql/bug32023/my50-bug32023/mysql-test$ time ../ndb/src/mgmclient/ndb_mgm --connect-string=localhost:10105 -e show
Connected to Management Server at: localhost:10105
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=1 (not connected, accepting connect from localhost)
id=2 (not connected, accepting connect from localhost)

[ndb_mgmd(MGM)] 1 node(s)
id=3    @localhost  (Version: 5.0.52)

[mysqld(API)]   4 node(s)
id=4 (not connected, accepting connect from any host)
id=5 (not connected, accepting connect from any host)
id=6 (not connected, accepting connect from any host)
id=7 (not connected, accepting connect from any host)

real    0m0.160s
user    0m0.000s
sys     0m0.000s
[1 Nov 2007 11:08] Stewart Smith
I'm 100% okay with the first part.

The second part, where we only wait for the timeout if there's nodes in the bitmap is a slight change of behaviour... if you started the first ndbd and (quickly) issued 'show', it wouldn't show up. With the wait in there, it does (as the 1000ms timeout is enough time for connection to be established and a HB) - and once it detects one HB from one node, we're done and continue.

Thoughts?
[1 Nov 2007 13:47] Magnus Blåudd
Yes, but that newly started node will show up as soon as it has _connected_ to the cluster. So, if it "quickly" connects we will send it a hearbeat and wait. Otherwise not.

Behavior is then "show me the cluster status _now_" as opposed to "show me cluster status now and 1 second into the future if no nodes are connected"? ;)
[2 Nov 2007 4:23] Stewart Smith
agree. patch ok.
[7 Dec 2007 23:08] Bugs System
Pushed into 6.0.5-alpha
[7 Dec 2007 23:09] Bugs System
Pushed into 5.1.23-rc
[7 Feb 2008 8:32] Jon Stephens
Documented in the 5.1.23 and 6.0.5 changelogs as follows:

        The management server was slow to respond when no data
        nodes were connected to the cluster. This was most noticeable
        when running SHOW in the management client.