Bug #13987 Cluster: Loss of data nodes can cause high CPU usage from ndb_mgmd
Submitted: 12 Oct 2005 22:40 Modified: 21 Nov 2007 9:49
Reporter: Jonathan Miller Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.0.15, 5.0.19 OS:Linux (Linux)
Assigned to: Stewart Smith CPU Architecture:Any

[12 Oct 2005 22:40] Jonathan Miller
Description:
This came out of an email from one of the consultants.

Basically you set up 6 hosts using 2 replicas:

host 1 = 1 ndbd - 1 mysqld - 1 ndb_mgmd
host 2 = 1 ndbd - 1 mysqld
host 3 = 1 ndbd - 1 mysqld 
host 4 = 1 ndbd - 1 mysqld 
host 5 = 1 ndbd - 1 mysqld 
host 6 = 1 ndbd - 1 mysqld 

1) Start the cluster and pump some data into a cluster database.
2) Do "top" on host 1 and notice that ndb_mgmd is using less the < 1% of the cpu.
Kill one of the data nodes. 

ndb_mgm is now using > 20% of the CPU and will continue to do so.

3) Kill another data node 

ndb_mgmd is now bouncing between 25 - 30 % of CPU usage and the ndb_mgm client is very slow in responding to commands

4) restart one of the killed data nodes. 

ndb_mgmd is not back to using 20% of the CPU.

5) restart the other killed data node.

ndb_mgmd returns to using less then < 1%  of the CPU.

Seem to be some looping going on inside the ndb_mgmd once a data node is lost.

With one killed
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
13772 ndbdev    15   0 75520  10m 1584 S 21.8  0.3   2:54.21 ndb_mgmd
13784 ndbdev    15   0 1373m 338m 2644 S  0.4 11.1   1:26.26 ndbd

With two killed
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
13772 ndbdev    15   0 75520  10m 1584 S 25.5  0.3   3:08.08 ndb_mgmd
13997 ndbdev    16   0  175m  25m 3860 S  0.5  0.8   1:42.98 mysqld

Cluster Log:
2005-10-13 00:36:46 [MgmSrvr] ALERT    -- Node 2: Node 4 Disconnected
2005-10-13 00:36:46 [MgmSrvr] INFO     -- Node 2: Communication to Node 4 closed
2005-10-13 00:36:46 [MgmSrvr] ALERT    -- Node 3: Node 4 Disconnected
2005-10-13 00:36:46 [MgmSrvr] ALERT    -- Node 5: Node 4 Disconnected
2005-10-13 00:36:46 [MgmSrvr] INFO     -- Node 3: Communication to Node 4 closed
2005-10-13 00:36:46 [MgmSrvr] INFO     -- Node 5: Communication to Node 4 closed
2005-10-13 00:36:46 [MgmSrvr] ALERT    -- Node 6: Node 4 Disconnected
2005-10-13 00:36:46 [MgmSrvr] INFO     -- Node 6: Communication to Node 4 closed
2005-10-13 00:36:46 [MgmSrvr] INFO     -- Node 7: Communication to Node 4 closed
2005-10-13 00:36:46 [MgmSrvr] ALERT    -- Node 2: Arbitration check won - node group majority
2005-10-13 00:36:46 [MgmSrvr] INFO     -- Node 2: President restarts arbitration thread [state=6]
2005-10-13 00:36:46 [MgmSrvr] ALERT    -- Node 7: Node 4 Disconnected
2005-10-13 00:36:49 [MgmSrvr] INFO     -- Node 5: Communication to Node 4 opened
2005-10-13 00:36:49 [MgmSrvr] INFO     -- Node 6: Communication to Node 4 opened
2005-10-13 00:36:49 [MgmSrvr] INFO     -- Node 3: Communication to Node 4 opened
2005-10-13 00:36:49 [MgmSrvr] INFO     -- Node 2: Communication to Node 4 opened
2005-10-13 00:36:50 [MgmSrvr] INFO     -- Node 7: Communication to Node 4 opened
2005-10-13 00:38:11 [MgmSrvr] INFO     -- Node 1: Node 7 Connected
2005-10-13 00:38:11 [MgmSrvr] ALERT    -- Node 2: Node 7 Disconnected
2005-10-13 00:38:11 [MgmSrvr] INFO     -- Node 2: Communication to Node 7 closed
2005-10-13 00:38:11 [MgmSrvr] ALERT    -- Node 3: Node 7 Disconnected
2005-10-13 00:38:11 [MgmSrvr] INFO     -- Node 3: Communication to Node 7 closed
2005-10-13 00:38:11 [MgmSrvr] ALERT    -- Node 5: Node 7 Disconnected
2005-10-13 00:38:11 [MgmSrvr] INFO     -- Node 5: Communication to Node 7 closed
2005-10-13 00:38:11 [MgmSrvr] ALERT    -- Node 6: Node 7 Disconnected
2005-10-13 00:38:11 [MgmSrvr] INFO     -- Node 6: Communication to Node 7 closed
2005-10-13 00:38:11 [MgmSrvr] ALERT    -- Node 2: Arbitration check won - node group majority
2005-10-13 00:38:11 [MgmSrvr] INFO     -- Node 2: President restarts arbitration thread [state=6]
2005-10-13 00:38:14 [MgmSrvr] INFO     -- Node 2: Communication to Node 7 opened
2005-10-13 00:38:14 [MgmSrvr] INFO     -- Node 5: Communication to Node 7 opened
2005-10-13 00:38:14 [MgmSrvr] INFO     -- Node 6: Communication to Node 7 opened
2005-10-13 00:38:15 [MgmSrvr] INFO     -- Node 3: Communication to Node 7 opened

How to repeat:
See above
[13 Oct 2005 15:36] Jonathan Miller
Update version
[13 Apr 2006 10:54] Lars Torstensson
I have this problem in version 5.0.19.

We also have a 6 node 3 replicas cluster.

When the database is idle the CPU on 5 (1,3,4,5,6) off the 6 data nodes is between 2-5%, the last node (2) how ever is between 40-99 (its is not the Master node).

The management server is on an average of 60% CPU load.

The cluster log doesn’t say anything unusual.

If I stop this node (2) none of the other nodes get the same CPU problem, but the ndb_mgmd is still on a high CPU load. And when the node (2) is restarted the problem is still there.

Is there any way I can give you more input?
[24 Apr 2006 17:49] Stewart Smith
great - it's reproducable?

I'd love your config.ini and exact configuration. You can upload these as private files.

I'll try and reproduce here. but otherwise, are you able to run a debug binary of ndb_mgmd? is this a production cluster or a test one?

if production, then don't go and run the debug stuff :)

running ndb_mgmd under strace could be useful too....

something like
strace -f -ff -r -o mgmtrace ndb_mgmd --nodaemon

may help me work out what's going on.
[25 Apr 2006 12:51] Lars Torstensson
I am not the original poster so I can't upload any files, but you can download the config.ini and mgmtrace file from http://surf.macallan.nu/trace.tar.gz

We have a 6 node 3 replicate cluster running on RedHat ES 4 (64bit) and a
ndb_mgmd running on RedHat ES 4 (32bit) on our test site.

Currently we are using 5.0.20a that we have built from the source tarball.

Yes we can run a debug binary of ndb_mgmd, is it a compiler option?
[25 Apr 2006 12:57] Lars Torstensson
The trace file contains a ndb_mgmd freakout lifecycle.

1. all nodes are running.
2. node 2 is stopped and the cpu on the ndb_mgmd increases.
3. node 2 is started and the cpu on the ndb_mgmd returns to normal.
.
[25 Apr 2006 14:54] Stewart Smith
Thank you very much, this should be very useful in tracking it down (or at least eliminating some possibilities).

I'll keep you posted on the progress through writing notes on the bug system.
[16 May 2006 8:30] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/6449
[17 May 2006 6:44] Tomas Ulin
pushed to 5.0.22 and 5.1.11

as we cannot reproduce, we can only say that this probably fixes this issue
[17 May 2006 7:40] Tomas Ulin
pushed to 4.1.20 (before in 5.0.22, 5.1.11)
[17 May 2006 13:13] Jon Stephens
Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

Documented bugfix in 4.1.20/5.0.22/5.1.11 changelogs. Closed.
[18 May 2006 6:47] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/6549
[22 May 2006 12:06] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/6694
[22 May 2006 13:17] Stewart Smith
Pushed my improvement to 5.1.11 and 5.0.22.

Jonas approved patch.
[8 Aug 2006 14:46] Lars Torstensson
It looks like it's still there in version 5.0.22
http://bugs.mysql.com/bug.php?id=21154
[9 Aug 2006 11:52] Hartmut Holzgraefe
The actual fix is in 5.0.23, not 5.0.22 which was a security fix release only
[10 Aug 2006 14:50] Lars Torstensson
I have verified that the bug is still there in 5.0.24, so I think you should reopen this.

I verified it on a 6 node 3 replika cluster.
[29 Aug 2006 7:52] Hartmut Holzgraefe
I was able to reproduce this on our test machines, although on a lower level,
running ndb_mgmd on a machine of its own and 2 ndbd processes on three dual CPU boxes.

action            ndb_mgmd CPU useage
-------------------------------------
all nodes up      0%
killed node 1     2.5 - 3%
killed node 3     4.5 - 5%
killed node 5     3.5 - 4%
started node 3    2 - 2.5%
started node 5    2.5 - 3%
started node 1    0%
[29 Aug 2006 12:53] Jonathan Miller
Hi,

Seems that both Lars and Hartmut are still seeing this issue. Therefore setting it back to Verified.
[3 Oct 2006 16:45] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/13028

ChangeSet@1.2265, 2006-10-04 02:38:31+10:00, stewart@willster.(none) +12 -0
  BUG#21154 The management server consumes too much CPU
  BUG#13987 Cluster: Loss of data nodes can cause high CPU usage from ndb_mgmd
  
  fix the actual problem (getting incomplete line of data), introduced with previous
  improvement.
  
  also add list sessions, get session and get session id to mgmapi to allow
  the implementation of a test case for this exact problem.
[5 Oct 2006 8:51] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/13103

ChangeSet@1.2265, 2006-10-05 18:51:13+10:00, stewart@willster.(none) +7 -0
  BUG#13987 Cluster: Loss of data nodes can cause high CPU usage from ndb_mgmd
  
  smaller patch for 5.0.
  
  complete patch going to 5.1 due to more intrusiveness for 'list sessions' etc
[5 Oct 2006 13:43] Stewart Smith
jonas approved simpler patch for 5.0, full for 5.1

merged and getting ready to push (running tests)
[5 Oct 2006 14:14] Stewart Smith
pushed to 5.0-ndb and 5.1-ndb
[10 Oct 2006 9:00] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/13380

ChangeSet@1.2060, 2006-10-10 11:00:20+02:00, jonas@perch.ndb.mysql.com +7 -0
  ndb - bug#13987
    recommit in mysql-5.1-wl2325-5.0
[10 Oct 2006 18:25] Jonas Oreland
pushed into 5.1.12
[19 Oct 2006 15:20] Lars Torstensson
I have verified that the bug is still there in 5.0.27 (I built it from Bitkeeper source 2006-10-19) if i use a 6 node 3 replika cluster. The mgmt server uses about 50% cpu.

On a 2 replica cluster it looks ok.
[25 Oct 2006 14:44] Lars Torstensson
I was a little bit to quick earlier when I said it looks ok on a 2 replica cluster. It only looks ok if you don't have any mysqld server connected to the cluster, if you do the cpu of the ndb_mgm ends up at about 35% when you stop one of the ndbd processes.
[26 Oct 2006 2:35] Stewart Smith
i think this is because the fix may not have hit the main trees yet.

you can get the -ndb trees from mysql.bkbits.net now and they certainly have the fix in them. if this still is a problem with them, i'll go cry in the corner... err.. i mean hunt more (as that means there's *another* bug that causes this).
[30 Oct 2006 4:01] Stewart Smith
Re-opening due to Lars Torstensson still being able to reproduce using 5.0-ndb tree with new fixes.

There must be another problem somewhere that has the same symptoms. Working with him to try and discover what may be the issue.
[25 Mar 2007 7:42] Stewart Smith
Mikael has provided me with some information that may be helpful in trying to track this down.

Although he added the disclaimer of "this was on beta hardware" - so it's totally possibly that this is due to hardware bugs in pre-release hardware and not in mgmd.

100% sending API_REGREQ to data nodes (filling their JobBuffer)
[4 Apr 2007 14:11] Geert Vanderkelen
Repeatable with versions 5.0.27, 5.0.30, 5.0.32, 5.0.33 and 5.0.34.

Not repeatable anymore with 5.0.36 and 5.0.37: CPU usage of ndb_mgmd is 0.0 and stays like that while nodes are being stopped.

Only change I could see with those versions is fix for bug 22013.
[5 Apr 2007 5:56] Stewart Smith
quite possible that this bug fixed things.... I'd still like to write up a test program or something though to prove it.
[21 Nov 2007 8:48] Tomas Ulin
A number of fixes have been made to adress this.  No opens support issues, and we have not seen any reports like this for a long time.  Therefore closing the bug.
[21 Nov 2007 9:49] Jon Stephens
Thank you for your bug report. This issue has already been fixed in the latest released version of that product, which you can download at

  http://www.mysql.com/downloads/

Documented bugfix in 5.0.36 and 5.1.12 changelogs, after discussing with Tomas.