Bug #45899 api nodes (re-)connecting to starting data nodes too fast ...
Submitted: 2 Jul 2009 8:23 Modified: 7 Apr 2014 9:32
Reporter: Hartmut Holzgraefe Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:mysql-5.1-telco-6.3 OS:Any
Assigned to: CPU Architecture:Any
Tags: 5.1.32-ndb-6.3.23-cluster

[2 Jul 2009 8:23] Hartmut Holzgraefe
Description:
While a data node is starting up already started mysqld nodes seem to constantly attempt to establish connections to it which the starting data node declines by closing the connection socket right after the initial dialog.

Given enough mysqlds (or a few mysqlds with high enough ndb-cluster-connection-pool setting) can eventually lead to all available TCP sockets on the data node machine being stuck in the TCP TIME_WAIT state, leading to the machine no longer being reachable from the outside.

How to repeat:
Create a cluster with sufficient data in it so that starting a node up takes a while.

Shut down a data node and restart it with strace (or truss) system call monitoring.

From the strace log you can see loads of continuous (or at least once per second per node id) connection attempts like:

  accept(71, 0, NULL)               = 174
  recvfrom(174, "ndbd\n", 5, 0, NULL, NULL) = 5
  recvfrom(174, "ndbd passwd\n", 12, 0, NULL, NULL) = 12
  sendto(174, "ok\n", 3, 0, NULL, 0) = 3
  recvfrom(174, "121 1\n", 6, 0, NULL, NULL) = 6
  close(174)             = 0

So the data node accepts a new incoming TCP connection, does the initial handshake, and after receiving an API node id immediately closes the connection.

When checking for sockets in TIME_WAIT status during the data node start you should see quite a few of them, even on a cluster with just one data node i've seen the count of

  netstat -nt | grep -c TIME_WAIT

go up to about 300+ with ndb-cluster-connection-pool=1, so with about 200 active api nodes the system is actually rather likely to eventually use up all 65K possible TCP sockets ...

Suggested fix:
Throttle the connection attempts to a starting data node somehow, if it turned down a connection attempt a few times already it is not likely to suddenly accept it the very next moment anyway.

As the TIME_WAIT state takes two times the systems packet lifetime (MSL, 2min on Solaris, 60s on Linux, 30s on *BSD afaict) i would suggest to fall back to that interval for reconnection attempts after a few continuous attempts have failed.

(no idea how to determine the actual MSL setting programmaticly though ...)
[3 Jul 2009 10:07] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/77868
[13 Jul 2009 14:08] Magnus BlÄudd
Testing has shown that patch is no good.

Will change tactic to make the server just return a good error and then leave it to the client to decide how fast it should try reconnect.
[17 Jul 2009 12:01] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/78932

2999 jack andrews	2009-07-17
      Bug #45899  	api nodes (re-)connecting to starting data nodes too fast ...
        . committed on behalf of magnus
      
      
      Delay further connection attempts after 3 disconnects
      during "handshake" phase. Increase the delay with number of
      rejected connects, but limit to max 10 seconds.
      
      3 failed connects -> 1 second "block"
      4                 -> 2
      ..                   ..
      13                -> 10
      
      
      NOTE! Remove printouts
      
       storage/ndb/src/common/transporter/Transporter.cpp         |   54 +++++++++++++
       storage/ndb/src/common/transporter/Transporter.hpp         |    8 +
       storage/ndb/src/common/transporter/TransporterRegistry.cpp |   12 ++
       3 files changed, 73 insertions(+), 1 deletion(-)
[23 Jul 2009 11:51] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/79148

3000 jack andrews	2009-07-23
      Bug #45899  	api nodes (re-)connecting to starting data nodes too fast ...
              . removed ndbout_c print out
[24 Jul 2009 11:04] jack andrews
hartmut reports the patch works well:
>
>   count accept() calls in logs:
>   ----------------------------
>   ndb-6.3.23       : 361286
>   ndb-6.3.23-patch2:   4083
> 
> So the number of connections that came in during data ndoe restart
> was reduced by almost 99% with the current patch and should solve
> Italtels problem perfectly.
> 

just need to check the ndb.ndb_partition_range failure is not related in CluB
[3 Aug 2009 16:13] Bugs System
Pushed into 5.1.35-ndb-7.0.7 (revid:jonas@mysql.com-20090803153854-un8jv9y2md8k221t) (version source revid:jonas@mysql.com-20090803153854-un8jv9y2md8k221t) (merge vers: 5.1.35-ndb-7.0.7) (pib:11)
[5 Aug 2009 13:50] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/80166

2961 Jonas Oreland	2009-08-05
      ndb - revert bug#45899
[12 Aug 2009 17:26] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/80699

2962 Jonas Oreland	2009-08-12
      ndb - bug#45899
        Fix bug in previous patch, causing platforms
        with CLOCK_MONOTONIC but not PTHREAD_CONDATTR_SETCLOCK
        to misbehave (e.g dl145a-c)
[12 Aug 2009 18:09] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/80706

3008 Jonas Oreland	2009-08-12
      ndb - bug#45899 - disable previous fix as it cause many tsts to fail in autotest
[12 Aug 2009 18:13] Bugs System
Pushed into 5.1.35-ndb-6.3.26 (revid:jonas@mysql.com-20090812180907-gg0o88k94c5h4i7b) (version source revid:jonas@mysql.com-20090812180907-gg0o88k94c5h4i7b) (merge vers: 5.1.35-ndb-6.3.26) (pib:11)
[12 Aug 2009 18:13] Bugs System
Pushed into 5.1.35-ndb-7.0.7 (revid:jonas@mysql.com-20090812181054-euhuy5ncf49fzupv) (version source revid:jonas@mysql.com-20090812181054-euhuy5ncf49fzupv) (merge vers: 5.1.35-ndb-7.0.7) (pib:11)
[12 Aug 2009 18:22] Jonas Oreland
reverted current patch is it causes many test-failures in autotest
basically connect time went from 100ms to 10s

new solution should be along the lines
1) if connect fails, retry again after 100ms
2) if server-side is not in CONNECTING
   instead of closing connection, send back BLOCKED (in text)
   and have client retry after 100ms (which might lead to a new BLOCKED)
3) This requires a new state CONNECTION_PENDING or similar on both server and
   client side.

This will not consume any extra ports
Maybe maybe a patch based on SO_LINGER (which also can remove TIME_WAIT)
  could be acceptable as a intermediate solution...