Bug #13987 | Cluster: Loss of data nodes can cause high CPU usage from ndb_mgmd | ||
---|---|---|---|
Submitted: | 12 Oct 2005 22:40 | Modified: | 21 Nov 2007 9:49 |
Reporter: | Jonathan Miller | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
Version: | 5.0.15, 5.0.19 | OS: | Linux (Linux) |
Assigned to: | Stewart Smith | CPU Architecture: | Any |
[12 Oct 2005 22:40]
Jonathan Miller
[13 Oct 2005 15:36]
Jonathan Miller
Update version
[13 Apr 2006 10:54]
Lars Torstensson
I have this problem in version 5.0.19. We also have a 6 node 3 replicas cluster. When the database is idle the CPU on 5 (1,3,4,5,6) off the 6 data nodes is between 2-5%, the last node (2) how ever is between 40-99 (its is not the Master node). The management server is on an average of 60% CPU load. The cluster log doesn’t say anything unusual. If I stop this node (2) none of the other nodes get the same CPU problem, but the ndb_mgmd is still on a high CPU load. And when the node (2) is restarted the problem is still there. Is there any way I can give you more input?
[24 Apr 2006 17:49]
Stewart Smith
great - it's reproducable? I'd love your config.ini and exact configuration. You can upload these as private files. I'll try and reproduce here. but otherwise, are you able to run a debug binary of ndb_mgmd? is this a production cluster or a test one? if production, then don't go and run the debug stuff :) running ndb_mgmd under strace could be useful too.... something like strace -f -ff -r -o mgmtrace ndb_mgmd --nodaemon may help me work out what's going on.
[25 Apr 2006 12:51]
Lars Torstensson
I am not the original poster so I can't upload any files, but you can download the config.ini and mgmtrace file from http://surf.macallan.nu/trace.tar.gz We have a 6 node 3 replicate cluster running on RedHat ES 4 (64bit) and a ndb_mgmd running on RedHat ES 4 (32bit) on our test site. Currently we are using 5.0.20a that we have built from the source tarball. Yes we can run a debug binary of ndb_mgmd, is it a compiler option?
[25 Apr 2006 12:57]
Lars Torstensson
The trace file contains a ndb_mgmd freakout lifecycle. 1. all nodes are running. 2. node 2 is stopped and the cpu on the ndb_mgmd increases. 3. node 2 is started and the cpu on the ndb_mgmd returns to normal. .
[25 Apr 2006 14:54]
Stewart Smith
Thank you very much, this should be very useful in tracking it down (or at least eliminating some possibilities). I'll keep you posted on the progress through writing notes on the bug system.
[16 May 2006 8:30]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/6449
[17 May 2006 6:44]
Tomas Ulin
pushed to 5.0.22 and 5.1.11 as we cannot reproduce, we can only say that this probably fixes this issue
[17 May 2006 7:40]
Tomas Ulin
pushed to 4.1.20 (before in 5.0.22, 5.1.11)
[17 May 2006 13:13]
Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release. If necessary, you can access the source repository and build the latest available version, including the bugfix, yourself. More information about accessing the source trees is available at http://www.mysql.com/doc/en/Installing_source_tree.html Additional info: Documented bugfix in 4.1.20/5.0.22/5.1.11 changelogs. Closed.
[18 May 2006 6:47]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/6549
[22 May 2006 12:06]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/6694
[22 May 2006 13:17]
Stewart Smith
Pushed my improvement to 5.1.11 and 5.0.22. Jonas approved patch.
[8 Aug 2006 14:46]
Lars Torstensson
It looks like it's still there in version 5.0.22 http://bugs.mysql.com/bug.php?id=21154
[9 Aug 2006 11:52]
Hartmut Holzgraefe
The actual fix is in 5.0.23, not 5.0.22 which was a security fix release only
[10 Aug 2006 14:50]
Lars Torstensson
I have verified that the bug is still there in 5.0.24, so I think you should reopen this. I verified it on a 6 node 3 replika cluster.
[29 Aug 2006 7:52]
Hartmut Holzgraefe
I was able to reproduce this on our test machines, although on a lower level, running ndb_mgmd on a machine of its own and 2 ndbd processes on three dual CPU boxes. action ndb_mgmd CPU useage ------------------------------------- all nodes up 0% killed node 1 2.5 - 3% killed node 3 4.5 - 5% killed node 5 3.5 - 4% started node 3 2 - 2.5% started node 5 2.5 - 3% started node 1 0%
[29 Aug 2006 12:53]
Jonathan Miller
Hi, Seems that both Lars and Hartmut are still seeing this issue. Therefore setting it back to Verified.
[3 Oct 2006 16:45]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/13028 ChangeSet@1.2265, 2006-10-04 02:38:31+10:00, stewart@willster.(none) +12 -0 BUG#21154 The management server consumes too much CPU BUG#13987 Cluster: Loss of data nodes can cause high CPU usage from ndb_mgmd fix the actual problem (getting incomplete line of data), introduced with previous improvement. also add list sessions, get session and get session id to mgmapi to allow the implementation of a test case for this exact problem.
[5 Oct 2006 8:51]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/13103 ChangeSet@1.2265, 2006-10-05 18:51:13+10:00, stewart@willster.(none) +7 -0 BUG#13987 Cluster: Loss of data nodes can cause high CPU usage from ndb_mgmd smaller patch for 5.0. complete patch going to 5.1 due to more intrusiveness for 'list sessions' etc
[5 Oct 2006 13:43]
Stewart Smith
jonas approved simpler patch for 5.0, full for 5.1 merged and getting ready to push (running tests)
[5 Oct 2006 14:14]
Stewart Smith
pushed to 5.0-ndb and 5.1-ndb
[10 Oct 2006 9:00]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/13380 ChangeSet@1.2060, 2006-10-10 11:00:20+02:00, jonas@perch.ndb.mysql.com +7 -0 ndb - bug#13987 recommit in mysql-5.1-wl2325-5.0
[10 Oct 2006 18:25]
Jonas Oreland
pushed into 5.1.12
[19 Oct 2006 15:20]
Lars Torstensson
I have verified that the bug is still there in 5.0.27 (I built it from Bitkeeper source 2006-10-19) if i use a 6 node 3 replika cluster. The mgmt server uses about 50% cpu. On a 2 replica cluster it looks ok.
[25 Oct 2006 14:44]
Lars Torstensson
I was a little bit to quick earlier when I said it looks ok on a 2 replica cluster. It only looks ok if you don't have any mysqld server connected to the cluster, if you do the cpu of the ndb_mgm ends up at about 35% when you stop one of the ndbd processes.
[26 Oct 2006 2:35]
Stewart Smith
i think this is because the fix may not have hit the main trees yet. you can get the -ndb trees from mysql.bkbits.net now and they certainly have the fix in them. if this still is a problem with them, i'll go cry in the corner... err.. i mean hunt more (as that means there's *another* bug that causes this).
[30 Oct 2006 4:01]
Stewart Smith
Re-opening due to Lars Torstensson still being able to reproduce using 5.0-ndb tree with new fixes. There must be another problem somewhere that has the same symptoms. Working with him to try and discover what may be the issue.
[25 Mar 2007 7:42]
Stewart Smith
Mikael has provided me with some information that may be helpful in trying to track this down. Although he added the disclaimer of "this was on beta hardware" - so it's totally possibly that this is due to hardware bugs in pre-release hardware and not in mgmd. 100% sending API_REGREQ to data nodes (filling their JobBuffer)
[4 Apr 2007 14:11]
Geert Vanderkelen
Repeatable with versions 5.0.27, 5.0.30, 5.0.32, 5.0.33 and 5.0.34. Not repeatable anymore with 5.0.36 and 5.0.37: CPU usage of ndb_mgmd is 0.0 and stays like that while nodes are being stopped. Only change I could see with those versions is fix for bug 22013.
[5 Apr 2007 5:56]
Stewart Smith
quite possible that this bug fixed things.... I'd still like to write up a test program or something though to prove it.
[21 Nov 2007 8:48]
Tomas Ulin
A number of fixes have been made to adress this. No opens support issues, and we have not seen any reports like this for a long time. Therefore closing the bug.
[21 Nov 2007 9:49]
Jon Stephens
Thank you for your bug report. This issue has already been fixed in the latest released version of that product, which you can download at http://www.mysql.com/downloads/ Documented bugfix in 5.0.36 and 5.1.12 changelogs, after discussing with Tomas.