Bug #45497 | ndb_mgmd opened file exhaustion and take 100% CPU usage | ||
---|---|---|---|
Submitted: | 15 Jun 2009 10:55 | Modified: | 9 Dec 2009 8:22 |
Reporter: | Emmanuel Bretelle | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S1 (Critical) |
Version: | mysql-5.1-telco-7.0 | OS: | Any (debian lenny) |
Assigned to: | Magnus Blåudd | CPU Architecture: | Any |
Tags: | 5.1.33, 5.1.35, 5.1.37, 5.1.39 |
[15 Jun 2009 10:55]
Emmanuel Bretelle
[15 Jun 2009 10:57]
Emmanuel Bretelle
netstats, strace, ndb config and config.log
Attachment: bug_logs.tar.bz2 (application/x-bzip, text), 62.52 KiB.
[25 Jun 2009 7:14]
Emmanuel Bretelle
here a graph showing how the number of connections in CLOSE_WAIT evolves. ndb_mgmd_watchdog.log-1.png : graph since last time the service was restarted. ndb_mgmd_watchdog.log-2.png : graph only showing timeframe "2009-06-25 03:00:00":"2009-06-25 05:00:00" ndb_mgmd_watchdog.log : logs the number of connection in CLOSE_WAIT state every 5 min It took about 10 hours before the service starts having connection in CLOSE_WAIT state. The number increased slowly and sporadically. Finally, on the 25/06/2009 at 03:35, the service started getting a lot of CLOSE_WAIT stalled connections. In about one hour, the number reached 550+ CLOSE_WAIT, and the service was restarted.
[25 Jun 2009 7:15]
Emmanuel Bretelle
ndb_mgmd_watchdog.log-1.png
Attachment: ndb_mgmd_watchdog.log-1.png (image/png, text), 3.20 KiB.
[25 Jun 2009 7:15]
Emmanuel Bretelle
ndb_mgmd_watchdog.log-2.png
Attachment: ndb_mgmd_watchdog.log-2.png (image/png, text), 4.31 KiB.
[25 Jun 2009 7:15]
Emmanuel Bretelle
ndb_mgmd_watchdog.log
Attachment: ndb_mgmd_watchdog.log (text/x-log), 78.66 KiB.
[25 Jun 2009 7:16]
Emmanuel Bretelle
If you guys need any other log... let me know.
[16 Jul 2009 16:26]
Emmanuel Bretelle
new graph
Attachment: ndb_mgmd_watchdog.log-over1month.png (image/png, text), 5.28 KiB.
[16 Jul 2009 16:27]
Emmanuel Bretelle
new logs
Attachment: ndb_mgmd_watchdog.log (application/octet-stream, text), 248.61 KiB.
[4 Aug 2009 10:55]
Emmanuel Bretelle
Uploaded a core and exe with debug symbols to ftp://ftp.mysql.com/pub/mysql/upload/ File is called bug-data-core-45497.tar.bz2 Hope this helps. Meanwhile, I am giving a spin to 5.1.37 in case it would help
[13 Aug 2009 8:59]
Emmanuel Bretelle
Adding 5.1.37 to the list
[13 Aug 2009 9:00]
Emmanuel Bretelle
bug-data-45497-ndb_mgmd-5.1.37-core.tar.bz2 has been uploaded. It contains a core of nmdb when the process was running 100% and had about 998 stalled network connections. Below, the bt full of the core: #0 0xb7f35424 in __kernel_vsyscall () No symbol table info available. #1 0xb7f20352 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/i686/cmov/libpthread.so.0 No symbol table info available. #2 0x08121ebc in NdbCondition_WaitTimeout (p_cond=0x906d6d0, p_mutex=0x904c6c0, msecs=0) at NdbCondition.c:98 result = <value optimized out> abstime = {tv_sec = 1250119826, tv_nsec = 468774000} secs = 1481179 #3 0x080d060c in ArbitMgr::sendSignalToThread (this=0x9043908, aSignal=@0xb7392070) at ClusterMgr.cpp:684 No locals. #4 0x080d1725 in ArbitMgr::doStart (this=0x9043908, theData=0x908aa04) at ClusterMgr.cpp:608 value = (void *) 0xb73920a8 aSignal = {gsn = 483, data = {sender = 0, code = 3, node = 0, ticket = {data = {0, 0}, static DataLength = 2, static TextLength = <optimized out>}, mask = {<BitmaskPOD<2u>> = {rep = {data = {0, 0}}, static Size = 2, static NotFound = <optimized out>, static TextLength = <optimized out>}, <No data fields>}, static SignalLength = 7}, timestamp = 1250057525765} #5 0x080d7d6c in execute (callbackObj=0x907b600, header=0xb7392208, prio=1 '\001', theData=0x908aa04, ptr=0xb7392224) at TransporterFacade.cpp:299 clusterMgr = <value optimized out> theFacade = (TransporterFacade *) 0x907b600 tRecBlockNo = <value optimized out> #6 0x08149be5 in TransporterRegistry::unpack (this=0x907a630, readPtr=0x908a9f4, sizeOfData=116, remoteNodeId=<value optimized out>, state=NoHalt) at Packer.cpp:113 messageLenBytes = 44 signalData = (Uint32 *) 0x908aa04 sectionData = (Uint32 *) 0x80 messageLen32 = <value optimized out> sectionPtr = (Uint32 *) 0xb7392208 signalHeader = {theVerId_signalNumber = 477, theReceiversBlockNumber = 4002, theSendersBlockRef = 16515074, theLength = 7, theSendersSignalId = 1722612918, theSignalId = 4294967295, theTrace = 0, m_noOfSections = 0 '\0', m_fragmentInfo = 0 '\0'} ptr = {{sz = 151562480, p = 0x10000}, {sz = 0, p = 0x0}, {sz = 5, p = 0x0}} usedData = 260 loop_count = 7 #7 0x081051f9 in TransporterRegistry::performReceive (this=0x907a630) at TransporterRegistry.cpp:823 ptr = (Uint32 *) 0x908a8f0 sz = 376 szUsed = <value optimized out> t = (class TCP_Transporter *) 0x9058300 nodeId = 2 socket = <value optimized out> i = 0 #8 0x080d763d in TransporterFacade::threadMainReceive (this=0x907b600) at TransporterFacade.cpp:513 i = 6 #9 0x080d76bd in runReceiveResponse_C (me=0x907b600) at TransporterFacade.cpp:485 No locals. #10 0x08121be8 in ndb_thread_wrapper (_ss=0x9051f78) at NdbThread.c:81 ret = <value optimized out> _db_func_ = 0x81a56d1 "?func" _db_file_ = 0x81a56d7 "?file" _db_level_ = 1 _db_framep_ = (char **) 0x0 #11 0xb7f1c4c0 in start_thread () from /lib/i686/cmov/libpthread.so.0 No symbol table info available. #12 0xb7cfc6de in clone () from /lib/i686/cmov/libc.so.6 No symbol table info available.
[25 Sep 2009 14:00]
Bernd Ocklin
I can verify this on Mac OS 10.6 and simply reproduce by simply starting a 2 node cluster. Is basically same observation like in bug#47214.
[29 Sep 2009 7:19]
Magnus Blåudd
Interesting to see that all the connections seems to be from the ndbd's, feels like the major problem could be to find out why they connect so rapidly. Seeing a couple of connections in CLOSE_WAIT is normal. Will test to see how ndb_mgmd handles "too many open files" and add a fix so it doesn't spin and take 100% CPU there. Probably also dump all active connections to cluster log when failure to accept connection occurs.
[29 Sep 2009 7:36]
Magnus Blåudd
Wrote a test that connects more and more clients to ndb_mgmd, when it's "full" it will take 100% CPU and effectively hang the connecting client for some time. When the client finally timeout, the ndb_mgmd will continue to loop and consume 100% CPU.
[29 Sep 2009 8:28]
Emmanuel Bretelle
Hi Magnus, Thanks for looking after this. If that can help, I can provide more cores, or test things. Let me know
[29 Sep 2009 9:49]
Magnus Blåudd
The ndb_mgmd has a "hidden" command that will print a list of all sessions. A session is one thread that handles one incoming connection. The command can be called by telneting to the ndb_mgmd port and then send the command "list session" followed by an empty line. Here is an example on my machine after I've run the test that connects until there are no more filehandles available. If we're lucky it could give a hint about what is going on and why the ndbd's connect so rapidly. shell> telnet localhost 1186 Trying 127.0.0.1... Connected to pilot. Escape character is '^]'. list sessions session: 1 session.1.m_stopSelf: 0 session.1.m_stop: 0 session.1.allocated.nodeid: 0 session.1.parser.buffer.len: 0 session.1.parser.buffer: session.1.parser.status: 0 session: 4459 session.4459.m_stopSelf: 0 session.4459.m_stop: 0 session.4459.allocated.nodeid: 0 session.4459.parser.buffer.len: 1 session.4459.parser.buffer: \n session.4459.parser.status: 0 bye Connection closed by foreign host. Another "trick" is to start ndb_mgmd with --verbose, that will make turn on the DEBUG category in it's logger output. So for example when I connect with telnet, it says. 2009-09-29 11:43:46 [MgmtSrvr] DEBUG -- 127.0.0.1:33878: Connected! If I send it an comment it doesn't know, that will also be logged. 2009-09-29 11:44:30 [MgmtSrvr] DEBUG -- 127.0.0.1:33878: Unknown command, 'this is a command you dont know' If I close telnet .. 2009-09-29 11:48:13 [MgmtSrvr] DEBUG -- 127.0.0.1:40961: Disconnected!
[29 Sep 2009 10:26]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/84970
[29 Sep 2009 10:28]
Magnus Blåudd
Can also see how the mgmapi function ndb_mgm_connect actually returns 0 when it's refused by the server. Will investigate more and write another patch for that.
[29 Sep 2009 12:43]
Magnus Blåudd
The patch for this bug is enough(and approved) - I filed a new "Bug #47712 ndb_mgm_connect returns ok but connection is not working" for the problem spotted by the test case for this bug.
[29 Sep 2009 12:47]
Magnus Blåudd
Hi Emmanuel, please use the tricks described above and see if you can figure out why you are getting so many connection attempts from thd ndbd(s). With the above patch the ndb_mgmd will not spin into 100% but instead take a step back and see if it can dispose of any closed connections. Maybe that helps, but if you get connection bursts from ndbd(s) it would be interesting to try and figure out why they do that. Check the cluster and error logs to see if they write something interesting. / Magnus
[29 Sep 2009 12:48]
Magnus Blåudd
Not related to bug#47214 so I'm removing OS X from the list
[29 Sep 2009 16:13]
Emmanuel Bretelle
Hi Magnus, Thanks for these explanations and patch. A few hunks of the patch do not apply against 5.1.37, so I will get it ported and try it later. I dont know if this can be useful to you, but here is the output of the list sessions: root@chef-mvno-vmdbman-01:~# lsof -i -p 14914 | grep ndb COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME ndb_mgmd 14914 root 4u IPv4 1193329038 TCP *:1186 (LISTEN) ndb_mgmd 14914 root 9u IPv4 1193343961 TCP mvno-vmdbman-01.chef.elephant:1186->mvno-vmdb-02.chef.elephant:39344 (ESTABLISHED) ndb_mgmd 14914 root 10u IPv4 1193343953 TCP mvno-vmdbman-01.chef.elephant:1186->mvno-vmdb-01.chef.elephant:38768 (ESTABLISHED) ndb_mgmd 14914 root 11u IPv4 1193341727 TCP mvno-vmdbman-01.chef.elephant:1186->mvno-vmdb-01.chef.elephant:54006 (CLOSE_WAIT) ndb_mgmd 14914 root 13u IPv4 1193343962 TCP mvno-vmdbman-01.chef.elephant:1186->mvno-vmdb-01.chef.elephant:36138 (ESTABLISHED) root@chef-mvno-vmdbman-01:~# telnet 127.0.0.1 1186 Trying 127.0.0.1... Connected to 127.0.0.1. Escape character is '^]'. list sessions session: 91 session.91.m_stopSelf: 0 session.91.m_stop: 0 session.91.allocated.nodeid: 0 session.91.parser.buffer.len: 1 session.91.parser.buffer: \n session.91.parser.status: 0 session: 90 session.90.m_stopSelf: 0 session.90.m_stop: 0 session.90.allocated.nodeid: 0 session.90.parser.buffer.len: 0 session.90.parser.buffer: session.90.parser.status: 0 session: 89 session.89.m_stopSelf: 0 session.89.m_stop: 0 session.89.allocated.nodeid: 0 session.89.parser.buffer.len: 0 session.89.parser.buffer: session.89.parser.status: 0 bye Connection closed by foreign host. For now, I am bumping all versions of clients + manager to 5.1.39. I will be running each of them with --verbose option and I will also run a tcpdump session on the manager to monitor every opening and closure of network connections on port 1186. In the case I am witnessing, I am not sure the issue is due to clients connecting to fast, but I rather feel like the connections are not closed successfully on the server side. I feel like the client close it, but the server do not shut it down. well, anyway, I come with an update in some times .
[8 Oct 2009 12:25]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/86150
[8 Oct 2009 12:25]
Magnus Blåudd
Pushed to 7.0
[8 Oct 2009 12:25]
Bugs System
Pushed into 5.1.39-ndb-7.0.9 (revid:magnus.blaudd@sun.com-20091008122418-ugsfietfwq4vy3nl) (version source revid:magnus.blaudd@sun.com-20091008122418-ugsfietfwq4vy3nl) (merge vers: 5.1.39-ndb-7.0.9) (pib:11)
[8 Oct 2009 13:31]
Bugs System
Pushed into 5.1.39-ndb-7.1.0 (revid:frazer@mysql.com-20091008132902-a4xc1mn6m82i4ix4) (version source revid:frazer@mysql.com-20091008132902-a4xc1mn6m82i4ix4) (merge vers: 5.1.39-ndb-7.1.0) (pib:11)
[8 Oct 2009 14:47]
Jon Stephens
Documented bugfix in the NDB-7.0.9 changelog as follows: ndb_mgmd failed to close client connections that had timed out. After running for some time, a race condition could develop due to ndb_mgmd having exhausted all of its file descriptors. See also Bug #47712. Closed.
[4 Nov 2009 8:02]
Emmanuel Bretelle
new behaviour of mgdb with magnus patch applied
Attachment: ndb_mgmd.txt.png (image/png, text), 5.63 KiB.
[4 Nov 2009 8:03]
Emmanuel Bretelle
Magnus Patch ported to 5.1.39
Attachment: bug45497_patch_port_to_5.1.39.diff (text/x-diff), 3.87 KiB.
[4 Nov 2009 8:15]
Emmanuel Bretelle
I have ported the patch to run on mysql 5.1.39 ( see bug45497_patch_port_to_5.1.39.diff ) and got it running for some time with success. As shown in ndb_mgmd.txt.png , connection still go up to 1000+ opened connections, but then socket clean up kicks in and with time, the number of opened file descriptor goes down.
[4 Nov 2009 8:20]
Magnus Blåudd
Hi, have you managed to figure out what is connecting so rapidly? / Magnus
[4 Nov 2009 12:17]
Emmanuel Bretelle
Unfortunately not, I did not take/have the time to investigate this. I will try to check in logs on the client side whenever I have time. Cheers
[4 Nov 2009 15:59]
Emmanuel Bretelle
Hum, Maybe I went to fast on my conclusion..... Hanged connection started between 2am and 3am on 28/10/2009. File exhaustion around 10am on the same day. On the client side, I do not see anything strange that would explain why so many connections are created. But, one it reached the 1000+ connection in CLOSE_WAIT state, using "list sessions" command hangs: # telnet 127.0.0.1 1186 Trying 127.0.0.1... Connected to 127.0.0.1. Escape character is '^]'. list sessions bye ^C ^] ndb_mgm hangs as well.... ndb_mgm> show Warning, event thread startup failed, degraded printouts as result, errno=115 Connected to Management Server at: localhost:1186 ^C So, I am not sure my port is correct :) The clients show: 2009-10-28 09:39:36 [ndbd] WARNING -- Watchdog: Warning overslept 204 ms, expected 0 ms. 2009-10-28 09:39:36 [ndbd] WARNING -- Ndb kernel is stuck in: Polling for Receive 2009-10-28 09:39:36 [ndbd] INFO -- Watchdog: User time: 14100 System time: 616306 2009-10-28 09:47:34 [ndbd] INFO -- Watchdog: User time: 14101 System time: 616335 2009-10-28 09:47:34 [ndbd] WARNING -- Watchdog: Warning overslept 212 ms, expected 0 ms. 2009-10-28 09:50:44 [ndbd] INFO -- Watchdog: User time: 14102 System time: 616344 2009-10-28 09:50:44 [ndbd] WARNING -- Watchdog: Warning overslept 223 ms, expected 0 ms. 2009-10-28 10:01:37 [ndbd] INFO -- Watchdog: User time: 14102 System time: 616385 2009-10-28 10:01:37 [ndbd] WARNING -- Watchdog: Warning overslept 214 ms, expected 0 ms. 2009-10-28 10:02:55 [ndbd] INFO -- Watchdog: User time: 14103 System time: 616389 2009-10-28 10:02:55 [ndbd] WARNING -- Watchdog: Warning overslept 248 ms, expected 0 ms. 2009-10-28 10:43:27 [ndbd] ERROR -- Error: Time out talking to management server: 110 2009-10-28 10:43:27 [ndbd] ERROR -- TransporterRegistry.cpp: 1395 2009-10-28 10:44:27 [ndbd] ERROR -- Error: Time out talking to management server: 110 2009-10-28 10:44:27 [ndbd] ERROR -- TransporterRegistry.cpp: 1395 2009-10-28 10:45:27 [ndbd] ERROR -- Error: Time out talking to management server: 110 2009-10-28 10:45:27 [ndbd] ERROR -- TransporterRegistry.cpp: 1395 2009-10-28 10:46:27 [ndbd] ERROR -- Error: Time out talking to management server: 110 2009-10-28 10:46:27 [ndbd] ERROR -- TransporterRegistry.cpp: 1395 On the server side, in the other hand, logs stops when the number of connections started to increase...: 2009-10-27 22:51:41 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 2 2009-10-27 22:51:41 [MgmSrvr] WARNING -- Node 3: Node 1 missed heartbeat 2 2009-10-27 22:51:47 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 2 2009-10-27 22:51:47 [MgmSrvr] WARNING -- Node 3: Node 1 missed heartbeat 2 2009-10-27 22:51:55 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 2 2009-10-27 22:51:55 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 3 2009-10-27 22:51:55 [MgmSrvr] WARNING -- Node 3: Node 1 missed heartbeat 2 2009-10-27 22:51:55 [MgmSrvr] WARNING -- Node 3: Node 1 missed heartbeat 3 2009-10-27 22:51:55 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected 2009-10-27 22:51:55 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected 2009-10-27 22:51:55 [MgmSrvr] INFO -- Node 1: Node 2 Connected 2009-10-27 22:51:57 [MgmSrvr] INFO -- Node 2: Started arbitrator node 1 [ticket=105a01bd97fbfafb] 2009-10-27 22:52:57 [MgmSrvr] INFO -- Node 1: Node 3 Connected 2009-10-27 23:21:56 [MgmSrvr] INFO -- Node 2: Local checkpoint 4952 started. Keep GCI = 7944347 oldest restorable GCI = 7521450 2009-10-28 00:21:01 [MgmSrvr] INFO -- Node 2: Local checkpoint 4953 started. Keep GCI = 7945965 oldest restorable GCI = 7521450 2009-10-28 01:19:58 [MgmSrvr] INFO -- Node 2: Local checkpoint 4954 started. Keep GCI = 7947584 oldest restorable GCI = 7521450 2009-10-28 02:18:56 [MgmSrvr] INFO -- Node 2: Local checkpoint 4955 started. Keep GCI = 7949202 oldest restorable GCI = 7521450 2009-10-28 02:20:58 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 2 2009-10-28 02:20:58 [MgmSrvr] WARNING -- Node 3: Node 1 missed heartbeat 2 2009-10-28 02:21:06 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 2 2009-10-28 02:21:09 [MgmSrvr] WARNING -- Node 3: Node 1 missed heartbeat 2 2009-10-28 02:21:24 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected 2009-10-28 02:21:24 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected 2009-10-28 02:21:25 [MgmSrvr] INFO -- Node 1: Node 3 Connected 2009-10-28 02:21:25 [MgmSrvr] INFO -- Node 1: Node 2 Connected Magnus, could I ask you to review the ported patch and see if there is any obvious error that would cause the oatch not to fix the issue. In the meantime, I will remove it and re-run mysql with it and see if I can find out why such connections get generated like crazy.
[6 Nov 2009 13:54]
Emmanuel Bretelle
I was running vanilla 5.1.39 this time and issue arised quite fast. I have one trace obtained with : tcpdump -i eth0 '( tcp[tcpflags] & (tcp-syn|tcp-fin) != 0 ) and port 1186' -s 0 -w connection_ndb.cap and here some timestamps with the number of connection in CLOSE_WAIT state: 04/11/2009 19:00:02 -> 1st CLOSE_WAIT 05/11/2009 19:40:02 -> 104 CLOSE_WAIT 05/11/2009 20:30:01 -> 203 CLOSE_WAIT 05/11/2009 21:20:02 -> 302 CLOSE_WAIT 05/11/2009 22:15:01 -> 408 CLOSE_WAIT 05/11/2009 23:05:01 -> 507 CLOSE_WAIT 05/11/2009 23:55:01 -> 604 CLOSE_WAIT 06/11/2009 00:45:02 -> 703 CLOSE_WAIT 06/11/2009 01:35:02 -> 803 CLOSE_WAIT 06/11/2009 02:30:02 -> 904 CLOSE_WAIT 06/11/2009 03:25:02 -> 997 CLOSE_WAIT + 100% CPU used Killed and restarted 06/11/2009 03:45:02 -> 2 CLOSE_WAIT ... ... In the trace, we can see that from 05/11/2009 19:40:02, we get about 100 new CLOSE_WAIT connection in 50 min. Looking at the tcpdump trace, we see that every minute, each node is: 1) Sending a FIN/ACK packet to the node manager 2) Creating a new connection. There is 2 nodes, creating 1 onnection every minute, this matches with the increase of 100 CLOSE_WAIT/50min. I dont know the reason why such connection happens from the client, but on the server, there is definitely something wrong. In ndb_1_cluster.log I get: 2009-11-05 15:43:43 [MgmSrvr] INFO -- Node 3: Local checkpoint 5165 started. Keep GCI = 8285418 oldest restorable GCI = 8250058 2009-11-05 16:42:28 [MgmSrvr] INFO -- Node 3: Local checkpoint 5166 started. Keep GCI = 8287031 oldest restorable GCI = 8250058 2009-11-05 17:41:15 [MgmSrvr] INFO -- Node 3: Local checkpoint 5167 started. Keep GCI = 8288647 oldest restorable GCI = 8250058 2009-11-05 18:40:07 [MgmSrvr] INFO -- Node 3: Local checkpoint 5168 started. Keep GCI = 8290260 oldest restorable GCI = 8250058 2009-11-05 18:47:38 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 2 2009-11-05 18:47:38 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 2 2009-11-05 18:47:38 [MgmSrvr] WARNING -- Node 3: Node 1 missed heartbeat 2 2009-11-05 18:47:38 [MgmSrvr] WARNING -- Node 3: Node 1 missed heartbeat 2 2009-11-05 18:47:46 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 2 2009-11-05 18:47:46 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 3 2009-11-05 18:47:46 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 2 2009-11-05 18:47:46 [MgmSrvr] WARNING -- Node 3: Node 1 missed heartbeat 2 2009-11-05 18:47:46 [MgmSrvr] WARNING -- Node 3: Node 1 missed heartbeat 3 2009-11-05 18:47:46 [MgmSrvr] WARNING -- Node 3: Node 1 missed heartbeat 2 2009-11-05 18:47:52 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 2 2009-11-05 18:47:52 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 3 2009-11-05 18:47:56 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 2 2009-11-05 18:47:56 [MgmSrvr] WARNING -- Node 3: Node 1 missed heartbeat 2 2009-11-05 18:47:56 [MgmSrvr] WARNING -- Node 3: Node 1 missed heartbeat 2 2009-11-05 18:47:56 [MgmSrvr] WARNING -- Node 3: Node 1 missed heartbeat 3 2009-11-05 18:48:00 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 2 2009-11-05 18:48:00 [MgmSrvr] WARNING -- Node 3: Node 1 missed heartbeat 2 2009-11-05 18:48:00 [MgmSrvr] WARNING -- Node 3: Node 1 missed heartbeat 3 2009-11-05 18:48:01 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 3 2009-11-05 18:48:01 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected 2009-11-05 18:48:02 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected 2009-11-05 18:48:04 [MgmSrvr] INFO -- Node 1: Node 3 Connected 2009-11-05 18:48:07 [MgmSrvr] INFO -- Node 1: Node 2 Connected 2009-11-05 18:48:13 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected 2009-11-06 03:25:36 [MgmSrvr] INFO -- NDB Cluster Management Server. Version 5.1.39 2009-11-06 03:25:36 [MgmSrvr] INFO -- Id: 1, Command port: 1186 2009-11-06 03:25:36 [MgmSrvr] INFO -- Node 1: Node 3 Connected 2009-11-06 03:25:36 [MgmSrvr] INFO -- Node 1: Node 2 Connected 2009-11-06 03:25:36 [MgmSrvr] INFO -- Node 3: Started arbitrator node 1 [ticket=75ce0010c5767ce8] 2009-11-06 03:30:24 [MgmSrvr] INFO -- Node 3: Local checkpoint 5177 started. Keep GCI = 8304772 oldest restorable GCI = 8250058 2009-11-06 03:42:16 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 2 2009-11-06 03:42:16 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 3 So, from the time the server start having an increasing number of CLOSE_WAIT connection, we dont have any logs. Magnus, the patch you provided does fix the 100% CPU, but do not solve this issue yet. At least under 5.1.39 :s. Below find the tcpdump trace. Cheers,
[6 Nov 2009 13:55]
Emmanuel Bretelle
connection_ndb.cap
Attachment: connection_ndb.cap (application/cap, text), 362.69 KiB.
[9 Nov 2009 16:58]
Magnus Blåudd
I would recomend to run MySQL Cluster 6.3 or 7.0 and not use the version from MySQL 5.1.39. Those two versions can be downloaded from http://www.mysql.com/cluster
[20 Nov 2009 13:39]
Emmanuel Bretelle
Hi Magnus, I tried the upgrade route and still have the same issue.... to some extends. Version used is mysql-cluster-gpl-7.0.8a.tar.gz node mgmt is not taking 100% cpu, but it seems that the network slot are not usuable anymore. Attached, a log bundle with mysqllogs-nodemgmt logs and the 2 nodes: mysqllogs-node1 and mysqllogs-node2 . At the time I collected the logs, ndb was not running on the nodes anymore, only the node manager ndb was still up . Using telnet on the node manager, and typing "list seesions" hangs and do not return anything. On the nodes, lsof -i | grep 1186 returns: root@chef-mvno-vmdb-01:/tmp# lsof -i | grep 1186 mysqld 10227 mysql 3u IPv4 74537234 TCP mvno-vmdb-01.chef.elephant:34424->mvno-vmdbman-01.chef.elephant:1186 (ESTABLISHED) mysqld 10227 mysql 5u IPv4 75610808 TCP mvno-vmdb-01.chef.elephant:43627->mvno-vmdbman-01.chef.elephant:1186 (ESTABLISHED) and root@chef-mvno-vmdb-02:/tmp# lsof -i | grep 1186 mysqld 27176 mysql 3u IPv4 74847554 TCP mvno-vmdb-02.chef.elephant:56541->mvno-vmdbman-01.chef.elephant:1186 (ESTABLISHED) mysqld 27176 mysql 5u IPv4 76086255 TCP mvno-vmdb-02.chef.elephant:54290->mvno-vmdbman-01.chef.elephant:1186 (ESTABLISHED) while on the node manager: root@chef-mvno-vmdbman-01:/tmp# lsof -i | grep 1186 | egrep '34424|43627|56541|54290' ndb_mgmd 8081 root 13u IPv4 2265681796 TCP mvno-vmdbman-01.chef.elephant:1186->mvno-vmdb-02.chef.elephant:56541 (ESTABLISHED) ndb_mgmd 8081 root 16u IPv4 2265681803 TCP mvno-vmdbman-01.chef.elephant:1186->mvno-vmdb-01.chef.elephant:34424 (ESTABLISHED) root@chef-mvno-vmdbman-01:/tmp# lsof -i | grep 1186 | wc -l 315 do u believe it is still related? any hints appreciated. regards, chantra
[20 Nov 2009 13:44]
Emmanuel Bretelle
logs uploaded on FTP server. Arcive is called mysqllogs-45497.tar.gz
[30 Nov 2009 8:28]
Emmanuel Bretelle
Hi Magnus, Did you get a chance to look at this issue? Shall a new one be opened or should I continue editing this one? Cheers,
[9 Dec 2009 8:22]
Emmanuel Bretelle
Hi Magnus, Did you get a chance to look at this issue? Any ideas?
[14 Dec 2009 15:38]
Magnus Blåudd
Hi, sorry about problem. Yes, I think you should actually try to report new bug report to make it easier to keep things apart. / Magnus