Bug #45497 ndb_mgmd opened file exhaustion and take 100% CPU usage
Submitted: 15 Jun 2009 10:55 Modified: 9 Dec 2009 8:22
Reporter: Emmanuel Bretelle Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:mysql-5.1-telco-7.0 OS:Any (debian lenny)
Assigned to: Magnus Blåudd CPU Architecture:Any
Tags: 5.1.33, 5.1.35, 5.1.37, 5.1.39

[15 Jun 2009 10:55] Emmanuel Bretelle
Description:

Host: debian lenny 32bits
kernel: 2.6.26-1-686

compiling options: ./configure --prefix=/usr/local/mysql-5.1.35 --enable-community-features --with-mysqld-user=mysql --with-mysqlmanager --with-plugins=ndbcluster

After running for some time (3-5 days), ndb_mgmd will start taking 100%CPU.
strace indicated that the process is looping on trying to open a new file descriptor upon the "accept" syscall but fails because ndb_mgmd has exhausted all its file descriptor.

netstat and lsof show that many (1K or so) sockets are stuck in CLOSE_WAIT state.

trying to connect to cli using "ndb_mgm" will fail as there is no more ressource available.

On the other hand, clients are running fine.

It looks like the client is shutting down the connection, while the server is only closing it but did not yet shut it down.

The workaround so far is to watch the ndb_mgmd process and restart it when the number of CLOSE_WAIT connection is getting high :s .

setting the severity to critical as the service is not available anymore. 

How to repeat:
Unknown so far, the issue is triggered after the service has been running for some days.
Hopefully, future will tell.
[15 Jun 2009 10:57] Emmanuel Bretelle
netstats, strace, ndb config and config.log

Attachment: bug_logs.tar.bz2 (application/x-bzip, text), 62.52 KiB.

[25 Jun 2009 7:14] Emmanuel Bretelle
here a graph showing how the number of connections in CLOSE_WAIT evolves.

ndb_mgmd_watchdog.log-1.png : graph since last time the service was restarted. 
ndb_mgmd_watchdog.log-2.png : graph only showing timeframe "2009-06-25 03:00:00":"2009-06-25 05:00:00"
ndb_mgmd_watchdog.log : logs the number of connection in CLOSE_WAIT state every 5 min

It took about 10 hours before the service starts having connection in CLOSE_WAIT state.
The number increased slowly and sporadically.
Finally, on the 25/06/2009 at 03:35, the service started getting a lot of CLOSE_WAIT stalled connections.
In about one hour, the number reached 550+ CLOSE_WAIT, and the service was restarted.
[25 Jun 2009 7:15] Emmanuel Bretelle
ndb_mgmd_watchdog.log-1.png

Attachment: ndb_mgmd_watchdog.log-1.png (image/png, text), 3.20 KiB.

[25 Jun 2009 7:15] Emmanuel Bretelle
ndb_mgmd_watchdog.log-2.png

Attachment: ndb_mgmd_watchdog.log-2.png (image/png, text), 4.31 KiB.

[25 Jun 2009 7:15] Emmanuel Bretelle
ndb_mgmd_watchdog.log

Attachment: ndb_mgmd_watchdog.log (text/x-log), 78.66 KiB.

[25 Jun 2009 7:16] Emmanuel Bretelle
If you guys need any other log... let me know.
[16 Jul 2009 16:26] Emmanuel Bretelle
new graph

Attachment: ndb_mgmd_watchdog.log-over1month.png (image/png, text), 5.28 KiB.

[16 Jul 2009 16:27] Emmanuel Bretelle
new logs

Attachment: ndb_mgmd_watchdog.log (application/octet-stream, text), 248.61 KiB.

[4 Aug 2009 10:55] Emmanuel Bretelle
Uploaded a core and exe with debug symbols to ftp://ftp.mysql.com/pub/mysql/upload/

File is called bug-data-core-45497.tar.bz2 

Hope this helps.

Meanwhile, I am giving a spin to 5.1.37 in case it would help
[13 Aug 2009 8:59] Emmanuel Bretelle
Adding 5.1.37 to the list
[13 Aug 2009 9:00] Emmanuel Bretelle
bug-data-45497-ndb_mgmd-5.1.37-core.tar.bz2 has been uploaded. It contains a core of nmdb when the process was running 100% and had about 998 stalled network connections.
Below, the bt full of the core:

#0  0xb7f35424 in __kernel_vsyscall ()
No symbol table info available.
#1  0xb7f20352 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/i686/cmov/libpthread.so.0
No symbol table info available.
#2  0x08121ebc in NdbCondition_WaitTimeout (p_cond=0x906d6d0, p_mutex=0x904c6c0, msecs=0) at NdbCondition.c:98
	result = <value optimized out>
	abstime = {tv_sec = 1250119826, tv_nsec = 468774000}
	secs = 1481179
#3  0x080d060c in ArbitMgr::sendSignalToThread (this=0x9043908, aSignal=@0xb7392070) at ClusterMgr.cpp:684
No locals.
#4  0x080d1725 in ArbitMgr::doStart (this=0x9043908, theData=0x908aa04) at ClusterMgr.cpp:608
	value = (void *) 0xb73920a8
	aSignal = {gsn = 483, data = {sender = 0, code = 3, node = 0, ticket = {data = {0, 0}, static DataLength = 2, static TextLength = <optimized out>}, 
    mask = {<BitmaskPOD<2u>> = {rep = {data = {0, 0}}, static Size = 2, static NotFound = <optimized out>, 
        static TextLength = <optimized out>}, <No data fields>}, static SignalLength = 7}, timestamp = 1250057525765}
#5  0x080d7d6c in execute (callbackObj=0x907b600, header=0xb7392208, prio=1 '\001', theData=0x908aa04, ptr=0xb7392224) at TransporterFacade.cpp:299
	clusterMgr = <value optimized out>
	theFacade = (TransporterFacade *) 0x907b600
	tRecBlockNo = <value optimized out>
#6  0x08149be5 in TransporterRegistry::unpack (this=0x907a630, readPtr=0x908a9f4, sizeOfData=116, remoteNodeId=<value optimized out>, state=NoHalt)
    at Packer.cpp:113
	messageLenBytes = 44
	signalData = (Uint32 *) 0x908aa04
	sectionData = (Uint32 *) 0x80
	messageLen32 = <value optimized out>
	sectionPtr = (Uint32 *) 0xb7392208
	signalHeader = {theVerId_signalNumber = 477, theReceiversBlockNumber = 4002, theSendersBlockRef = 16515074, theLength = 7, 
  theSendersSignalId = 1722612918, theSignalId = 4294967295, theTrace = 0, m_noOfSections = 0 '\0', m_fragmentInfo = 0 '\0'}
	ptr = {{sz = 151562480, p = 0x10000}, {sz = 0, p = 0x0}, {sz = 5, p = 0x0}}
	usedData = 260
	loop_count = 7
#7  0x081051f9 in TransporterRegistry::performReceive (this=0x907a630) at TransporterRegistry.cpp:823
	ptr = (Uint32 *) 0x908a8f0
	sz = 376
	szUsed = <value optimized out>
	t = (class TCP_Transporter *) 0x9058300
	nodeId = 2
	socket = <value optimized out>
	i = 0
#8  0x080d763d in TransporterFacade::threadMainReceive (this=0x907b600) at TransporterFacade.cpp:513
	i = 6
#9  0x080d76bd in runReceiveResponse_C (me=0x907b600) at TransporterFacade.cpp:485
No locals.
#10 0x08121be8 in ndb_thread_wrapper (_ss=0x9051f78) at NdbThread.c:81
	ret = <value optimized out>
	_db_func_ = 0x81a56d1 "?func"
	_db_file_ = 0x81a56d7 "?file"
	_db_level_ = 1
	_db_framep_ = (char **) 0x0
#11 0xb7f1c4c0 in start_thread () from /lib/i686/cmov/libpthread.so.0
No symbol table info available.
#12 0xb7cfc6de in clone () from /lib/i686/cmov/libc.so.6
No symbol table info available.
[25 Sep 2009 14:00] Bernd Ocklin
I can verify this on Mac OS 10.6 and simply reproduce by simply starting a 2 node cluster. Is basically same observation like in bug#47214.
[29 Sep 2009 7:19] Magnus Blåudd
Interesting to see that all the connections seems to be from the ndbd's, feels like the major problem could be to find out why they connect so rapidly.

Seeing a couple of connections in CLOSE_WAIT is normal.

Will test to see how ndb_mgmd handles "too many open files" and add a fix so it doesn't spin and take 100% CPU there.

Probably also dump all active connections to cluster log when failure to accept connection occurs.
[29 Sep 2009 7:36] Magnus Blåudd
Wrote a test that connects more and more clients to ndb_mgmd, when it's "full" it will take 100% CPU and effectively hang the connecting client for some time.

When the client finally timeout, the ndb_mgmd  will continue to loop and consume 100% CPU.
[29 Sep 2009 8:28] Emmanuel Bretelle
Hi Magnus,

Thanks for looking after this.

If that can help, I can provide more cores, or test things.

Let me know
[29 Sep 2009 9:49] Magnus Blåudd
The ndb_mgmd has a "hidden" command that will print a list of all sessions.  A session is one thread that handles one incoming connection.

The command can be called by telneting to the ndb_mgmd port and then send the command "list session" followed by an empty line.

Here is an example on my machine after I've run the test that connects until there are no more filehandles available. If we're lucky it could give a hint about what is going on and why the ndbd's connect so rapidly.

shell> telnet localhost 1186
Trying 127.0.0.1...
Connected to pilot.
Escape character is '^]'.
list sessions

session: 1
session.1.m_stopSelf: 0
session.1.m_stop: 0
session.1.allocated.nodeid: 0
session.1.parser.buffer.len: 0
session.1.parser.buffer: 
session.1.parser.status: 0
session: 4459
session.4459.m_stopSelf: 0
session.4459.m_stop: 0
session.4459.allocated.nodeid: 0
session.4459.parser.buffer.len: 1
session.4459.parser.buffer: \n
session.4459.parser.status: 0

bye

Connection closed by foreign host.

Another "trick" is to start ndb_mgmd with --verbose, that will make turn on the DEBUG category in it's logger output. So for example when I connect with telnet, it says.
2009-09-29 11:43:46 [MgmtSrvr] DEBUG    -- 127.0.0.1:33878: Connected!

If I send it an comment it doesn't know, that will also be logged.

2009-09-29 11:44:30 [MgmtSrvr] DEBUG    -- 127.0.0.1:33878: Unknown command, 'this is a command you dont know'

If I close telnet ..
2009-09-29 11:48:13 [MgmtSrvr] DEBUG    -- 127.0.0.1:40961: Disconnected!
[29 Sep 2009 10:26] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/84970
[29 Sep 2009 10:28] Magnus Blåudd
Can also see how the mgmapi function ndb_mgm_connect actually returns 0 when it's refused by the server. Will investigate more and write another patch for that.
[29 Sep 2009 12:43] Magnus Blåudd
The patch for this bug is enough(and approved) -  I filed a new "Bug #47712 ndb_mgm_connect returns ok but connection is not working" for the problem spotted by the test case for this bug.
[29 Sep 2009 12:47] Magnus Blåudd
Hi Emmanuel,

please use the tricks described above and see if you can figure out why you are getting so many connection attempts from thd ndbd(s).

With the above patch the ndb_mgmd will not spin into 100% but instead take a step back and see if it can dispose  of any closed connections. Maybe that helps, but if you get connection bursts from ndbd(s) it would be interesting to try and figure out why they do that. Check the cluster and error logs to see if they write something interesting.

/ Magnus
[29 Sep 2009 12:48] Magnus Blåudd
Not related to bug#47214 so I'm removing OS X from the list
[29 Sep 2009 16:13] Emmanuel Bretelle
Hi Magnus,

Thanks for these explanations and patch.

A few hunks of the patch do not apply against 5.1.37, so I will get it ported and try it later.

I dont know if this can be useful to you, but here is the output of the list sessions:

root@chef-mvno-vmdbman-01:~# lsof -i -p 14914 | grep ndb
COMMAND    PID        USER   FD   TYPE     DEVICE SIZE NODE NAME
ndb_mgmd 14914        root    4u  IPv4 1193329038       TCP *:1186 (LISTEN)
ndb_mgmd 14914        root    9u  IPv4 1193343961       TCP mvno-vmdbman-01.chef.elephant:1186->mvno-vmdb-02.chef.elephant:39344 (ESTABLISHED)
ndb_mgmd 14914        root   10u  IPv4 1193343953       TCP mvno-vmdbman-01.chef.elephant:1186->mvno-vmdb-01.chef.elephant:38768 (ESTABLISHED)
ndb_mgmd 14914        root   11u  IPv4 1193341727       TCP mvno-vmdbman-01.chef.elephant:1186->mvno-vmdb-01.chef.elephant:54006 (CLOSE_WAIT)
ndb_mgmd 14914        root   13u  IPv4 1193343962       TCP mvno-vmdbman-01.chef.elephant:1186->mvno-vmdb-01.chef.elephant:36138 (ESTABLISHED)

root@chef-mvno-vmdbman-01:~# telnet 127.0.0.1 1186 
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
list sessions

session: 91
session.91.m_stopSelf: 0
session.91.m_stop: 0
session.91.allocated.nodeid: 0
session.91.parser.buffer.len: 1
session.91.parser.buffer: \n
session.91.parser.status: 0
session: 90
session.90.m_stopSelf: 0
session.90.m_stop: 0
session.90.allocated.nodeid: 0
session.90.parser.buffer.len: 0
session.90.parser.buffer: 
session.90.parser.status: 0
session: 89
session.89.m_stopSelf: 0
session.89.m_stop: 0
session.89.allocated.nodeid: 0
session.89.parser.buffer.len: 0
session.89.parser.buffer: 
session.89.parser.status: 0

bye

Connection closed by foreign host.

For now, I am bumping all versions of clients + manager to 5.1.39. I will be running each of them with --verbose option and I will also run a tcpdump session on the manager to monitor every opening and closure of network connections on port 1186.

In the case I am witnessing, I am not sure the issue is due to clients connecting to fast, but I rather feel like the connections are not closed successfully on the server side.

I feel like the client close it, but the server do not shut it down.

well, anyway, I come with an update in some times .
[8 Oct 2009 12:25] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/86150
[8 Oct 2009 12:25] Magnus Blåudd
Pushed to 7.0
[8 Oct 2009 12:25] Bugs System
Pushed into 5.1.39-ndb-7.0.9 (revid:magnus.blaudd@sun.com-20091008122418-ugsfietfwq4vy3nl) (version source revid:magnus.blaudd@sun.com-20091008122418-ugsfietfwq4vy3nl) (merge vers: 5.1.39-ndb-7.0.9) (pib:11)
[8 Oct 2009 13:31] Bugs System
Pushed into 5.1.39-ndb-7.1.0 (revid:frazer@mysql.com-20091008132902-a4xc1mn6m82i4ix4) (version source revid:frazer@mysql.com-20091008132902-a4xc1mn6m82i4ix4) (merge vers: 5.1.39-ndb-7.1.0) (pib:11)
[8 Oct 2009 14:47] Jon Stephens
Documented bugfix in the NDB-7.0.9 changelog as follows:

        ndb_mgmd failed to close client connections that had timed out.
        After running for some time, a race condition could develop due
        to ndb_mgmd having exhausted all of its file descriptors.

        See also Bug #47712.

Closed.
[4 Nov 2009 8:02] Emmanuel Bretelle
new behaviour of mgdb with magnus patch applied

Attachment: ndb_mgmd.txt.png (image/png, text), 5.63 KiB.

[4 Nov 2009 8:03] Emmanuel Bretelle
Magnus Patch ported to 5.1.39

Attachment: bug45497_patch_port_to_5.1.39.diff (text/x-diff), 3.87 KiB.

[4 Nov 2009 8:15] Emmanuel Bretelle
I have ported the patch to run on mysql 5.1.39 ( see  bug45497_patch_port_to_5.1.39.diff ) and got it running for some time with success.

As shown in  ndb_mgmd.txt.png , connection still go up to 1000+ opened connections, but then socket clean up kicks in and with time, the number of opened file descriptor goes down.
[4 Nov 2009 8:20] Magnus Blåudd
Hi, have you managed to figure out what is connecting so rapidly?

/ Magnus
[4 Nov 2009 12:17] Emmanuel Bretelle
Unfortunately not, I did not take/have the time to investigate this.

I will try to check in logs on the client side whenever I have time.

Cheers
[4 Nov 2009 15:59] Emmanuel Bretelle
Hum,

Maybe I went to fast on my conclusion.....

Hanged connection started between 2am and 3am on 28/10/2009. File exhaustion around 10am on the same day.

On the client side, I do not see anything strange that would explain why so many connections are created. But, one it reached the 1000+ connection in CLOSE_WAIT state, using "list sessions" command hangs:
# telnet 127.0.0.1 1186 
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
list sessions

bye

^C

^]

ndb_mgm hangs as well....

ndb_mgm> show

Warning, event thread startup failed, degraded printouts as result, errno=115
Connected to Management Server at: localhost:1186
^C

So, I am not sure my port is correct :)

The clients show:
2009-10-28 09:39:36 [ndbd] WARNING  -- Watchdog: Warning overslept 204 ms, expected 0 ms.
2009-10-28 09:39:36 [ndbd] WARNING  -- Ndb kernel is stuck in: Polling for Receive
2009-10-28 09:39:36 [ndbd] INFO     -- Watchdog: User time: 14100  System time: 616306
2009-10-28 09:47:34 [ndbd] INFO     -- Watchdog: User time: 14101  System time: 616335
2009-10-28 09:47:34 [ndbd] WARNING  -- Watchdog: Warning overslept 212 ms, expected 0 ms.
2009-10-28 09:50:44 [ndbd] INFO     -- Watchdog: User time: 14102  System time: 616344
2009-10-28 09:50:44 [ndbd] WARNING  -- Watchdog: Warning overslept 223 ms, expected 0 ms.
2009-10-28 10:01:37 [ndbd] INFO     -- Watchdog: User time: 14102  System time: 616385
2009-10-28 10:01:37 [ndbd] WARNING  -- Watchdog: Warning overslept 214 ms, expected 0 ms.
2009-10-28 10:02:55 [ndbd] INFO     -- Watchdog: User time: 14103  System time: 616389
2009-10-28 10:02:55 [ndbd] WARNING  -- Watchdog: Warning overslept 248 ms, expected 0 ms.
2009-10-28 10:43:27 [ndbd] ERROR    -- Error: Time out talking to management server: 110
2009-10-28 10:43:27 [ndbd] ERROR    -- TransporterRegistry.cpp: 1395
2009-10-28 10:44:27 [ndbd] ERROR    -- Error: Time out talking to management server: 110
2009-10-28 10:44:27 [ndbd] ERROR    -- TransporterRegistry.cpp: 1395
2009-10-28 10:45:27 [ndbd] ERROR    -- Error: Time out talking to management server: 110
2009-10-28 10:45:27 [ndbd] ERROR    -- TransporterRegistry.cpp: 1395
2009-10-28 10:46:27 [ndbd] ERROR    -- Error: Time out talking to management server: 110
2009-10-28 10:46:27 [ndbd] ERROR    -- TransporterRegistry.cpp: 1395

On the server side, in the other hand, logs stops when the number of connections started to increase...:

2009-10-27 22:51:41 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 2
2009-10-27 22:51:41 [MgmSrvr] WARNING  -- Node 3: Node 1 missed heartbeat 2
2009-10-27 22:51:47 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 2
2009-10-27 22:51:47 [MgmSrvr] WARNING  -- Node 3: Node 1 missed heartbeat 2
2009-10-27 22:51:55 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 2
2009-10-27 22:51:55 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 3
2009-10-27 22:51:55 [MgmSrvr] WARNING  -- Node 3: Node 1 missed heartbeat 2
2009-10-27 22:51:55 [MgmSrvr] WARNING  -- Node 3: Node 1 missed heartbeat 3
2009-10-27 22:51:55 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2009-10-27 22:51:55 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2009-10-27 22:51:55 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2009-10-27 22:51:57 [MgmSrvr] INFO     -- Node 2: Started arbitrator node 1 [ticket=105a01bd97fbfafb]
2009-10-27 22:52:57 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2009-10-27 23:21:56 [MgmSrvr] INFO     -- Node 2: Local checkpoint 4952 started. Keep GCI = 7944347 oldest restorable GCI = 7521450
2009-10-28 00:21:01 [MgmSrvr] INFO     -- Node 2: Local checkpoint 4953 started. Keep GCI = 7945965 oldest restorable GCI = 7521450
2009-10-28 01:19:58 [MgmSrvr] INFO     -- Node 2: Local checkpoint 4954 started. Keep GCI = 7947584 oldest restorable GCI = 7521450
2009-10-28 02:18:56 [MgmSrvr] INFO     -- Node 2: Local checkpoint 4955 started. Keep GCI = 7949202 oldest restorable GCI = 7521450
2009-10-28 02:20:58 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 2
2009-10-28 02:20:58 [MgmSrvr] WARNING  -- Node 3: Node 1 missed heartbeat 2
2009-10-28 02:21:06 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 2
2009-10-28 02:21:09 [MgmSrvr] WARNING  -- Node 3: Node 1 missed heartbeat 2
2009-10-28 02:21:24 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2009-10-28 02:21:24 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2009-10-28 02:21:25 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2009-10-28 02:21:25 [MgmSrvr] INFO     -- Node 1: Node 2 Connected

Magnus, could I ask you to review the ported patch and see if there is any obvious error that would cause the oatch not to fix the issue.

In the meantime, I will remove it and re-run mysql with it and see if I can find out why such connections get generated like crazy.
[6 Nov 2009 13:54] Emmanuel Bretelle
I was running vanilla 5.1.39 this time and issue arised quite fast.
I have one trace obtained with :
tcpdump -i eth0 '( tcp[tcpflags] &  (tcp-syn|tcp-fin) != 0 ) and port 1186' -s 0 -w connection_ndb.cap

and here some timestamps with the number of connection in CLOSE_WAIT state:
04/11/2009 19:00:02 -> 1st CLOSE_WAIT
05/11/2009 19:40:02 -> 104 CLOSE_WAIT
05/11/2009 20:30:01 -> 203 CLOSE_WAIT
05/11/2009 21:20:02 -> 302 CLOSE_WAIT
05/11/2009 22:15:01 -> 408 CLOSE_WAIT
05/11/2009 23:05:01 -> 507 CLOSE_WAIT
05/11/2009 23:55:01 -> 604 CLOSE_WAIT
06/11/2009 00:45:02 -> 703 CLOSE_WAIT
06/11/2009 01:35:02 -> 803 CLOSE_WAIT
06/11/2009 02:30:02 -> 904 CLOSE_WAIT
06/11/2009 03:25:02 -> 997 CLOSE_WAIT + 100% CPU used
Killed and restarted
06/11/2009 03:45:02 -> 2 CLOSE_WAIT
...
...

In the trace, we can see that from 05/11/2009 19:40:02, we get about 100 new CLOSE_WAIT connection in 50 min.
Looking at the tcpdump trace, we see that every minute, each node is:
1) Sending a FIN/ACK packet to the node manager
2) Creating a new connection.
There is 2 nodes, creating 1 onnection every minute, this matches with the increase of 100 CLOSE_WAIT/50min.

I dont know the reason why such connection happens from the client, but on the server, there is definitely something wrong.

In  ndb_1_cluster.log I get: 
2009-11-05 15:43:43 [MgmSrvr] INFO     -- Node 3: Local checkpoint 5165 started. Keep GCI = 8285418 oldest restorable GCI = 8250058
2009-11-05 16:42:28 [MgmSrvr] INFO     -- Node 3: Local checkpoint 5166 started. Keep GCI = 8287031 oldest restorable GCI = 8250058
2009-11-05 17:41:15 [MgmSrvr] INFO     -- Node 3: Local checkpoint 5167 started. Keep GCI = 8288647 oldest restorable GCI = 8250058
2009-11-05 18:40:07 [MgmSrvr] INFO     -- Node 3: Local checkpoint 5168 started. Keep GCI = 8290260 oldest restorable GCI = 8250058
2009-11-05 18:47:38 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 2
2009-11-05 18:47:38 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 2
2009-11-05 18:47:38 [MgmSrvr] WARNING  -- Node 3: Node 1 missed heartbeat 2
2009-11-05 18:47:38 [MgmSrvr] WARNING  -- Node 3: Node 1 missed heartbeat 2
2009-11-05 18:47:46 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 2
2009-11-05 18:47:46 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 3
2009-11-05 18:47:46 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 2
2009-11-05 18:47:46 [MgmSrvr] WARNING  -- Node 3: Node 1 missed heartbeat 2
2009-11-05 18:47:46 [MgmSrvr] WARNING  -- Node 3: Node 1 missed heartbeat 3
2009-11-05 18:47:46 [MgmSrvr] WARNING  -- Node 3: Node 1 missed heartbeat 2
2009-11-05 18:47:52 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 2
2009-11-05 18:47:52 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 3
2009-11-05 18:47:56 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 2
2009-11-05 18:47:56 [MgmSrvr] WARNING  -- Node 3: Node 1 missed heartbeat 2
2009-11-05 18:47:56 [MgmSrvr] WARNING  -- Node 3: Node 1 missed heartbeat 2
2009-11-05 18:47:56 [MgmSrvr] WARNING  -- Node 3: Node 1 missed heartbeat 3
2009-11-05 18:48:00 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 2
2009-11-05 18:48:00 [MgmSrvr] WARNING  -- Node 3: Node 1 missed heartbeat 2
2009-11-05 18:48:00 [MgmSrvr] WARNING  -- Node 3: Node 1 missed heartbeat 3
2009-11-05 18:48:01 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 3
2009-11-05 18:48:01 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2009-11-05 18:48:02 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2009-11-05 18:48:04 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2009-11-05 18:48:07 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2009-11-05 18:48:13 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2009-11-06 03:25:36 [MgmSrvr] INFO     -- NDB Cluster Management Server. Version 5.1.39
2009-11-06 03:25:36 [MgmSrvr] INFO     -- Id: 1, Command port: 1186
2009-11-06 03:25:36 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2009-11-06 03:25:36 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2009-11-06 03:25:36 [MgmSrvr] INFO     -- Node 3: Started arbitrator node 1 [ticket=75ce0010c5767ce8]
2009-11-06 03:30:24 [MgmSrvr] INFO     -- Node 3: Local checkpoint 5177 started. Keep GCI = 8304772 oldest restorable GCI = 8250058
2009-11-06 03:42:16 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 2
2009-11-06 03:42:16 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 3

So, from the time the server start having an increasing number of CLOSE_WAIT connection, we dont have any logs.

Magnus, the patch you provided does fix the 100% CPU, but do not solve this issue yet. At least under 5.1.39 :s.

Below find the tcpdump trace.

Cheers,
[6 Nov 2009 13:55] Emmanuel Bretelle
connection_ndb.cap

Attachment: connection_ndb.cap (application/cap, text), 362.69 KiB.

[9 Nov 2009 16:58] Magnus Blåudd
I would recomend to run MySQL Cluster 6.3 or 7.0 and not use the version from MySQL 5.1.39. Those two versions can be downloaded from http://www.mysql.com/cluster
[20 Nov 2009 13:39] Emmanuel Bretelle
Hi Magnus,

I tried the upgrade route and still have the same issue.... to some extends.

Version used is mysql-cluster-gpl-7.0.8a.tar.gz

node mgmt is not taking 100% cpu, but it seems that the network slot are not usuable anymore.

Attached, a log bundle with mysqllogs-nodemgmt logs and the 2 nodes:
mysqllogs-node1 and mysqllogs-node2 .

At the time I collected the logs, ndb was not running on the nodes anymore, only the node manager  ndb was still up .

Using telnet on the node manager, and typing "list seesions" hangs and do not return anything.

On the nodes, lsof -i | grep 1186 returns:
root@chef-mvno-vmdb-01:/tmp# lsof -i | grep 1186
mysqld  10227       mysql    3u  IPv4 74537234       TCP mvno-vmdb-01.chef.elephant:34424->mvno-vmdbman-01.chef.elephant:1186 (ESTABLISHED)
mysqld  10227       mysql    5u  IPv4 75610808       TCP mvno-vmdb-01.chef.elephant:43627->mvno-vmdbman-01.chef.elephant:1186 (ESTABLISHED)

and

root@chef-mvno-vmdb-02:/tmp# lsof -i | grep 1186
mysqld  27176       mysql    3u  IPv4 74847554       TCP mvno-vmdb-02.chef.elephant:56541->mvno-vmdbman-01.chef.elephant:1186 (ESTABLISHED)
mysqld  27176       mysql    5u  IPv4 76086255       TCP mvno-vmdb-02.chef.elephant:54290->mvno-vmdbman-01.chef.elephant:1186 (ESTABLISHED)

while on the node manager:
root@chef-mvno-vmdbman-01:/tmp# lsof -i | grep 1186 | egrep '34424|43627|56541|54290'
ndb_mgmd  8081        root   13u  IPv4 2265681796       TCP mvno-vmdbman-01.chef.elephant:1186->mvno-vmdb-02.chef.elephant:56541 (ESTABLISHED)
ndb_mgmd  8081        root   16u  IPv4 2265681803       TCP mvno-vmdbman-01.chef.elephant:1186->mvno-vmdb-01.chef.elephant:34424 (ESTABLISHED)
root@chef-mvno-vmdbman-01:/tmp# lsof -i | grep 1186 | wc -l
315

do u believe it is still related?

any hints appreciated.

regards,
chantra
[20 Nov 2009 13:44] Emmanuel Bretelle
logs uploaded on FTP server. Arcive is called mysqllogs-45497.tar.gz
[30 Nov 2009 8:28] Emmanuel Bretelle
Hi Magnus,

Did you get a chance to look at this issue?

Shall a new one be opened or should I continue editing this one?

Cheers,
[9 Dec 2009 8:22] Emmanuel Bretelle
Hi Magnus,

Did you get a chance to look at this issue? Any ideas?
[14 Dec 2009 15:38] Magnus Blåudd
Hi,

sorry about problem.

Yes, I think you should actually try to report  new bug report to make it easier to keep things apart.

/ Magnus