Bug #53440 erratic index when bringing data node online into a working cluster
Submitted: 5 May 2010 15:41 Modified: 26 Dec 2010 15:52
Reporter: Richard McCluskey Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:mysql-5.1-telco-7.0 OS:Linux (x86_64 2.6.18-53.1.21.el5 #1 SMP)
Assigned to: CPU Architecture:Any
Tags: 7.0.9b

[5 May 2010 15:41] Richard McCluskey
Description:
Cluster Configuration
---------------------
[ndbd(NDB)]	2 node(s)
id=4	@10.32.4.20  (mysql-5.1.39 ndb-7.0.9, Nodegroup: 0)
id=5	@10.32.4.30  (mysql-5.1.39 ndb-7.0.9, Nodegroup: 0, Master)

[ndb_mgmd(MGM)]	1 node(s)
id=1	@10.32.4.10  (mysql-5.1.39 ndb-7.0.9)

[mysqld(API)]	5 node(s)
id=2	@10.32.4.40  (mysql-5.1.39 ndb-7.0.9)
id=3	@10.32.4.50  (mysql-5.1.39 ndb-7.0.9)
id=6 (not connected, accepting connect from 10.32.4.60)
id=7 (not connected, accepting connect from 10.32.4.70)
id=8 (not connected, accepting connect from any host)

the two data nodes are both 4 X quad core Intel chips with 24Gig of memory. They have run for the last 75 days without issue. Last friday the CPUs started really pegging out and the whole cluster slowed down. IT was decided that maybe we should drop each data node in turn and bring it back up again, flushing the memory as it were.
When we tried to bring the first on (node id 4) back online it never sync'd. It has been 14 hours now and the node 5 usage is at 67% while node 4 usage is still at 60%. Even worse is the constant changing of the node 4 index from 81% to 0% to 81% . a small snippet from the log here :

2010-05-05 09:02:39 [MgmtSrvr] INFO     -- Node 4: Index usage decreased to 0%(128 8K pages of total 72288)
2010-05-05 09:02:40 [MgmtSrvr] INFO     -- Node 4: Index usage increased to 81%(59243 8K pages of total 72288)
2010-05-05 09:02:40 [MgmtSrvr] INFO     -- Node 4: Index usage decreased to 0%(128 8K pages of total 72288)
2010-05-05 09:02:41 [MgmtSrvr] INFO     -- Node 4: Index usage increased to 81%(59243 8K pages of total

This has gone on for a long time, roughly 17 hours since attempting to bring it online. 

Yes we use tablespaces and disk data storage

the biggest table holds about 16.5 million records, of about 3 - 5 k each in size
the next biggest table adds about 5 to 10 million records a day, but is truncated daily (after being drawn off into a warehouse)

How to repeat:

start with 'pegged' pair of data nodes in a 2 dn, 2sql, 1master cluster.
stop 1 data node, wait 10 minutes and start the node up again. Watch node index act like a schizophrenic having an argument with itself.
[5 May 2010 16:04] Richard McCluskey
I have uploaded the ndb_error_report data (was a large amount!) to :

ftp.mysql.com/pub/mysql/upload.

The file is called : bug-data-53440-ndb_error_report.tar.bz2
[5 May 2010 16:07] Richard McCluskey
When I say the Data nodes were pegged out, I mean that :

One of the cores was constantly at 95% or CPU usage by the ndb(mt)d  process

Memory consumption went far above the settings in config.ini. In normal operation the machine had a spare 750M of memory. When the system 'pegged' it was down to 67M of memory.

Under normal operation the CPU usage sits at around 67 to 78%