Bug #21154 The management server consumes too much CPU
Submitted: 19 Jul 2006 15:20 Modified: 3 Oct 2006 16:52
Reporter: Lars Torstensson Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.0.24 OS:Linux (Redhat Linux 2.6.9-34.ELsmp #1 )
Assigned to: Assigned Account CPU Architecture:Any

[19 Jul 2006 15:20] Lars Torstensson
Description:
We have a 6 node 3 replicas cluster.

If I stop one of the 6 data nodes the cpu time on the management server (ndb_mgmd) goes to 99.9%. If I restar the data node the cpu time for the ndb_mgmd returns to normal (0.5%).

This bug reminds of 13987

 

How to repeat:
Stop one of the 6 data nodes
[19 Jul 2006 15:33] MySQL Verification Team
Changing to Cluster Catergory.
[16 Aug 2006 7:56] Stewart Smith
This is notoriously hard to reproduce and seems to never want to occur when we test it.

In BUG#13987, we fixed two problems that could cause higher CPU usage in ndb_mgmd. However, there's obviously another one (as mentioned in support issue, this was reproducable with 5.0.24, after these fixes).

So, i'd like to get a really clear picture of the exact setup that you can reproduce this with - including how many ndb_mgm clients connected, what commands issued etc.

Hopefully we can get this reproduced in the lab, otherwise, hopefully we'll be able to arrange some remote debugging setup to help us pinpoint the problem.
[21 Aug 2006 11:20] Lars Torstensson
As I mentioned we have a 6 node 3 replicas cluster on RedHat Linux. Currently configured with 2 mgm-servers (only one is started). The data nodes are on "Linux 2.6.9-34.ELsmp #1 SMP" 64bit kernel and the started mgm-server has "Linux 2.6.9-34.ELsmp #1 SMP" 32bit kernel.
[26 Sep 2006 6:39] Stewart Smith
changing to In Progress a while after it's actually been "in progress".

I believe I have a fix, as well as a method to reproduce. Going to write test to double check.

I have been able to reproduce some form of this bug locally now too.

Although I'm not sure how this makes sense with the usage returning to normal when the node rejoins the cluster.... perhaps some TCP stack fun in there.
[3 Oct 2006 16:52] Stewart Smith
I have a fix, a test, and patch.

Attached to BUG#13987 as this is technically a duplicate of it.