MySQL Bugs: #45899: api nodes (re-)connecting to starting data nodes too fast ...

Bug #45899	api nodes (re-)connecting to starting data nodes too fast ...
Submitted:	2 Jul 2009 8:23	Modified:	7 Apr 2014 9:32
Reporter:	Hartmut Holzgraefe	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	mysql-5.1-telco-6.3	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	5.1.32-ndb-6.3.23-cluster

Description:
While a data node is starting up already started mysqld nodes seem to constantly attempt to establish connections to it which the starting data node declines by closing the connection socket right after the initial dialog.

Given enough mysqlds (or a few mysqlds with high enough ndb-cluster-connection-pool setting) can eventually lead to all available TCP sockets on the data node machine being stuck in the TCP TIME_WAIT state, leading to the machine no longer being reachable from the outside.

How to repeat:
Create a cluster with sufficient data in it so that starting a node up takes a while.

Shut down a data node and restart it with strace (or truss) system call monitoring.

From the strace log you can see loads of continuous (or at least once per second per node id) connection attempts like:

  accept(71, 0, NULL)               = 174
  recvfrom(174, "ndbd\n", 5, 0, NULL, NULL) = 5
  recvfrom(174, "ndbd passwd\n", 12, 0, NULL, NULL) = 12
  sendto(174, "ok\n", 3, 0, NULL, 0) = 3
  recvfrom(174, "121 1\n", 6, 0, NULL, NULL) = 6
  close(174)             = 0

So the data node accepts a new incoming TCP connection, does the initial handshake, and after receiving an API node id immediately closes the connection.

When checking for sockets in TIME_WAIT status during the data node start you should see quite a few of them, even on a cluster with just one data node i've seen the count of

  netstat -nt | grep -c TIME_WAIT

go up to about 300+ with ndb-cluster-connection-pool=1, so with about 200 active api nodes the system is actually rather likely to eventually use up all 65K possible TCP sockets ...

Suggested fix:
Throttle the connection attempts to a starting data node somehow, if it turned down a connection attempt a few times already it is not likely to suddenly accept it the very next moment anyway.

As the TIME_WAIT state takes two times the systems packet lifetime (MSL, 2min on Solaris, 60s on Linux, 30s on *BSD afaict) i would suggest to fall back to that interval for reconnection attempts after a few continuous attempts have failed.

(no idea how to determine the actual MSL setting programmaticly though ...)

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/77868

Testing has shown that patch is no good.

Will change tactic to make the server just return a good error and then leave it to the client to decide how fast it should try reconnect.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/78932

2999 jack andrews	2009-07-17
      Bug #45899  	api nodes (re-)connecting to starting data nodes too fast ...
        . committed on behalf of magnus
      
      
      Delay further connection attempts after 3 disconnects
      during "handshake" phase. Increase the delay with number of
      rejected connects, but limit to max 10 seconds.
      
      3 failed connects -> 1 second "block"
      4                 -> 2
      ..                   ..
      13                -> 10
      
      
      NOTE! Remove printouts
      
       storage/ndb/src/common/transporter/Transporter.cpp         |   54 +++++++++++++
       storage/ndb/src/common/transporter/Transporter.hpp         |    8 +
       storage/ndb/src/common/transporter/TransporterRegistry.cpp |   12 ++
       3 files changed, 73 insertions(+), 1 deletion(-)

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/79148

3000 jack andrews	2009-07-23
      Bug #45899  	api nodes (re-)connecting to starting data nodes too fast ...
              . removed ndbout_c print out

hartmut reports the patch works well:
>
>   count accept() calls in logs:
>   ----------------------------
>   ndb-6.3.23       : 361286
>   ndb-6.3.23-patch2:   4083
> 
> So the number of connections that came in during data ndoe restart
> was reduced by almost 99% with the current patch and should solve
> Italtels problem perfectly.
> 

just need to check the ndb.ndb_partition_range failure is not related in CluB

Pushed into 5.1.35-ndb-7.0.7 (revid:jonas@mysql.com-20090803153854-un8jv9y2md8k221t) (version source revid:jonas@mysql.com-20090803153854-un8jv9y2md8k221t) (merge vers: 5.1.35-ndb-7.0.7) (pib:11)

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/80166

2961 Jonas Oreland	2009-08-05
      ndb - revert bug#45899

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/80699

2962 Jonas Oreland	2009-08-12
      ndb - bug#45899
        Fix bug in previous patch, causing platforms
        with CLOCK_MONOTONIC but not PTHREAD_CONDATTR_SETCLOCK
        to misbehave (e.g dl145a-c)

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/80706

3008 Jonas Oreland	2009-08-12
      ndb - bug#45899 - disable previous fix as it cause many tsts to fail in autotest

Pushed into 5.1.35-ndb-6.3.26 (revid:jonas@mysql.com-20090812180907-gg0o88k94c5h4i7b) (version source revid:jonas@mysql.com-20090812180907-gg0o88k94c5h4i7b) (merge vers: 5.1.35-ndb-6.3.26) (pib:11)

Pushed into 5.1.35-ndb-7.0.7 (revid:jonas@mysql.com-20090812181054-euhuy5ncf49fzupv) (version source revid:jonas@mysql.com-20090812181054-euhuy5ncf49fzupv) (merge vers: 5.1.35-ndb-7.0.7) (pib:11)

reverted current patch is it causes many test-failures in autotest
basically connect time went from 100ms to 10s

new solution should be along the lines
1) if connect fails, retry again after 100ms
2) if server-side is not in CONNECTING
   instead of closing connection, send back BLOCKED (in text)
   and have client retry after 100ms (which might lead to a new BLOCKED)
3) This requires a new state CONNECTION_PENDING or similar on both server and
   client side.

This will not consume any extra ports
Maybe maybe a patch based on SO_LINGER (which also can remove TIME_WAIT)
  could be acceptable as a intermediate solution...