Bug #45899 api nodes (re-)connecting to starting data nodes too fast ...
Submitted: 2 Jul 10:23 Modified: 12 Aug 20:22
Reporter: Hartmut Holzgraefe
Status: In progress
Category:Server: Cluster Severity:S2 (Serious)
Version:mysql-5.1-telco-6.3 OS:Any
Assigned to: Magnus Blaudd Target Version:
Tags: 5.1.32-ndb-6.3.23-cluster
Triage: Triaged: D2 (Serious) / R3 (Medium) / E3 (Medium)

[2 Jul 10:23] Hartmut Holzgraefe
Description:
While a data node is starting up already started mysqld nodes seem to constantly attempt
to establish connections to it which the starting data node declines by closing the
connection socket right after the initial dialog.

Given enough mysqlds (or a few mysqlds with high enough ndb-cluster-connection-pool
setting) can eventually lead to all available TCP sockets on the data node machine being
stuck in the TCP TIME_WAIT state, leading to the machine no longer being reachable from
the outside.

How to repeat:
Create a cluster with sufficient data in it so that starting a node up takes a while.

Shut down a data node and restart it with strace (or truss) system call monitoring.

From the strace log you can see loads of continuous (or at least once per second per node
id) connection attempts like:

  accept(71, 0, NULL)               = 174
  recvfrom(174, "ndbd\n", 5, 0, NULL, NULL) = 5
  recvfrom(174, "ndbd passwd\n", 12, 0, NULL, NULL) = 12
  sendto(174, "ok\n", 3, 0, NULL, 0) = 3
  recvfrom(174, "121 1\n", 6, 0, NULL, NULL) = 6
  close(174)             = 0

So the data node accepts a new incoming TCP connection, does the initial handshake, and
after receiving an API node id immediately closes the connection.

When checking for sockets in TIME_WAIT status during the data node start you should see
quite a few of them, even on a cluster with just one data node i've seen the count of

  netstat -nt | grep -c TIME_WAIT

go up to about 300+ with ndb-cluster-connection-pool=1, so with about 200 active api
nodes the system is actually rather likely to eventually use up all 65K possible TCP
sockets ...

Suggested fix:
Throttle the connection attempts to a starting data node somehow, if it turned down a
connection attempt a few times already it is not likely to suddenly accept it the very
next moment anyway.

As the TIME_WAIT state takes two times the systems packet lifetime (MSL, 2min on Solaris,
60s on Linux, 30s on *BSD afaict) i would suggest to fall back to that interval for
reconnection attempts after a few continuous attempts have failed.

(no idea how to determine the actual MSL setting programmaticly though ...)
[3 Jul 12:07] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/77868
[13 Jul 16:08] Magnus Blaudd
Testing has shown that patch is no good.

Will change tactic to make the server just return a good error and then leave it to the
client to decide how fast it should try reconnect.
[17 Jul 14:01] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/78932

2999 jack andrews	2009-07-17
      Bug #45899  	api nodes (re-)connecting to starting data nodes too fast ...
        . committed on behalf of magnus
      
      
      Delay further connection attempts after 3 disconnects
      during "handshake" phase. Increase the delay with number of
      rejected connects, but limit to max 10 seconds.
      
      3 failed connects -> 1 second "block"
      4                 -> 2
      ..                   ..
      13                -> 10
      
      
      NOTE! Remove printouts
      
       storage/ndb/src/common/transporter/Transporter.cpp         |   54 +++++++++++++
       storage/ndb/src/common/transporter/Transporter.hpp         |    8 +
       storage/ndb/src/common/transporter/TransporterRegistry.cpp |   12 ++
       3 files changed, 73 insertions(+), 1 deletion(-)
[23 Jul 13:51] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/79148

3000 jack andrews	2009-07-23
      Bug #45899  	api nodes (re-)connecting to starting data nodes too fast ...
              . removed ndbout_c print out
[24 Jul 13:04] jack andrews
hartmut reports the patch works well:
>
>   count accept() calls in logs:
>   ----------------------------
>   ndb-6.3.23       : 361286
>   ndb-6.3.23-patch2:   4083
> 
> So the number of connections that came in during data ndoe restart
> was reduced by almost 99% with the current patch and should solve
> Italtels problem perfectly.
> 

just need to check the ndb.ndb_partition_range failure is not related in CluB
[3 Aug 18:13] Bugs System
Pushed into 5.1.35-ndb-7.0.7 (revid:jonas@mysql.com-20090803153854-un8jv9y2md8k221t)
(version source revid:jonas@mysql.com-20090803153854-un8jv9y2md8k221t) (merge vers:
5.1.35-ndb-7.0.7) (pib:11)
[5 Aug 15:50] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/80166

2961 Jonas Oreland	2009-08-05
      ndb - revert bug#45899
[12 Aug 19:26] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/80699

2962 Jonas Oreland	2009-08-12
      ndb - bug#45899
        Fix bug in previous patch, causing platforms
        with CLOCK_MONOTONIC but not PTHREAD_CONDATTR_SETCLOCK
        to misbehave (e.g dl145a-c)
[12 Aug 20:09] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/80706

3008 Jonas Oreland	2009-08-12
      ndb - bug#45899 - disable previous fix as it cause many tsts to fail in autotest
[12 Aug 20:13] Bugs System
Pushed into 5.1.35-ndb-6.3.26 (revid:jonas@mysql.com-20090812180907-gg0o88k94c5h4i7b)
(version source revid:jonas@mysql.com-20090812180907-gg0o88k94c5h4i7b) (merge vers:
5.1.35-ndb-6.3.26) (pib:11)
[12 Aug 20:13] Bugs System
Pushed into 5.1.35-ndb-7.0.7 (revid:jonas@mysql.com-20090812181054-euhuy5ncf49fzupv)
(version source revid:jonas@mysql.com-20090812181054-euhuy5ncf49fzupv) (merge vers:
5.1.35-ndb-7.0.7) (pib:11)
[12 Aug 20:22] Jonas Oreland
reverted current patch is it causes many test-failures in autotest
basically connect time went from 100ms to 10s

new solution should be along the lines
1) if connect fails, retry again after 100ms
2) if server-side is not in CONNECTING
   instead of closing connection, send back BLOCKED (in text)
   and have client retry after 100ms (which might lead to a new BLOCKED)
3) This requires a new state CONNECTION_PENDING or similar on both server and
   client side.

This will not consume any extra ports
Maybe maybe a patch based on SO_LINGER (which also can remove TIME_WAIT)
  could be acceptable as a intermediate solution...