MySQL Bugs: #21154: The management server consumes too much CPU

Bug #21154	The management server consumes too much CPU
Submitted:	19 Jul 2006 15:20	Modified:	3 Oct 2006 16:52
Reporter:	Lars Torstensson	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	5.0.24	OS:	Linux (Redhat Linux 2.6.9-34.ELsmp #1 )
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
We have a 6 node 3 replicas cluster.

If I stop one of the 6 data nodes the cpu time on the management server (ndb_mgmd) goes to 99.9%. If I restar the data node the cpu time for the ndb_mgmd returns to normal (0.5%).

This bug reminds of 13987

 

How to repeat:
Stop one of the 6 data nodes

Changing to Cluster Catergory.

This is notoriously hard to reproduce and seems to never want to occur when we test it.

In BUG#13987, we fixed two problems that could cause higher CPU usage in ndb_mgmd. However, there's obviously another one (as mentioned in support issue, this was reproducable with 5.0.24, after these fixes).

So, i'd like to get a really clear picture of the exact setup that you can reproduce this with - including how many ndb_mgm clients connected, what commands issued etc.

Hopefully we can get this reproduced in the lab, otherwise, hopefully we'll be able to arrange some remote debugging setup to help us pinpoint the problem.

As I mentioned we have a 6 node 3 replicas cluster on RedHat Linux. Currently configured with 2 mgm-servers (only one is started). The data nodes are on "Linux 2.6.9-34.ELsmp #1 SMP" 64bit kernel and the started mgm-server has "Linux 2.6.9-34.ELsmp #1 SMP" 32bit kernel.

changing to In Progress a while after it's actually been "in progress".

I believe I have a fix, as well as a method to reproduce. Going to write test to double check.

I have been able to reproduce some form of this bug locally now too.

Although I'm not sure how this makes sense with the usage returning to normal when the node rejoins the cluster.... perhaps some TCP stack fun in there.

I have a fix, a test, and patch.

Attached to BUG#13987 as this is technically a duplicate of it.