Bug #40684 Connector/NET does not obey Connect Timeout
Submitted: 13 Nov 2008 6:08 Modified: 11 Aug 11:01
Reporter: d di (Basic Quality Contributor)
Status: Closed
Category:Connector/Net Severity:S1 (Critical)
Version:Any OS:Any (Tested on 5.0.9)
Assigned to: Vladislav Vaintroub Target Version:
Tags: qc
Triage: Needs Triage: D1 (Critical)

[13 Nov 2008 6:08] d di
Description:
There's a web server.  There's a MySQL database server.  In between them sits a firewall.
 In front of the web server are customers.  The customers complain that applications
sporadically takes several minutes to load.

Checking the trace logs, the time is spent in MySqlConnection.Open().

Debugging commences.  First, try to connect from a cold versus a warm connection pool
(press ENTER when asked):

========================================
15:10:05  Info: Connect Timeout=3
15:10:06  Fishing connection...
15:10:06  Success (id: 525).  Checking state...
15:10:06  Done (state: Open).  Releasing connection...

Ready to fish connection a second time - press ENTER.

15:10:08  Fishing connection...
15:10:08  Success (id: 525).  Checking state...
15:10:08  Done (state: Open).  Releasing connection...
========================================

Awesome.

There's a firewall in between the two hosts, and it will drop the pooled connection's TCP
circuit after some time, to avoid session hijacking or whatever.  It's a common thing. 
Simple firewalls will keep the connection alive if there's a MySQL wire protocol ping
happening inside, more advanced firewalls won't.

We'll simulate the firewall by yanking the cable to the database server.  (Alternatively,
you can open a console on the database server and null route the "web server"'s IP address
when it says "yank cable" below - same effect.)

Now to make sure the connection timeout is working.  First, from a cold pool:

a) yank the cable to the database server and
b) run the test.

========================================
15:12:18  Info: Connect Timeout=3
15:12:18  Fishing connection...
15:12:22  Exception: Unable to connect to any of the specified MySQL hosts.
========================================

Notice the time stamps.  Times out after 3 seconds.  Perfect.

Next, test with a warm cache.  First:

c) reattach cable to database server
d) run the test (don't press ENTER).

========================================
15:14:28  Info: Connect Timeout=3
15:14:28  Fishing connection...
15:14:28  Success (id: 43).  Checking state...
15:14:28  Done (state: Open).  Releasing connection...

Ready to fish connection a second time - press ENTER.
========================================

Ok, it's warm... Now:

e) yank the cable to the database server again, then
f) press ENTER to continue the test.

========================================
15:14:35  Fishing connection...
15:15:28  Exception: Unable to connect to any of the specified MySQL hosts.
========================================

Fail.  53 seconds passed before the call to Open() timed out, which is 50 seconds too
much.

Please fix :-).

How to repeat:
See description, use soon-to-be-attached test script.

Suggested fix:
How a connection pool should work:
 1) Keep a number of connections open.
 2) Connection attempts time out after "Connect Timeout" seconds.
 3) When connections are released back to the pool, run "RESET CONNECTION STATE" on
server.
 4) While connections are in pool, run a wire protocol ping every 5 seconds to ensure
they're hot.
 5) The wire protocol ping should time out according to the "Connect Timeout" variable.
 6) When a connection is fished, deliver the one that was most recently pinged.

Note 1):
  Minimum and maximum number of connections in pool must be customizable (as is now -
great).

Note 2):
  "Connect Timeout" must work, regardless if a TCP connect() is too slow, or a MySQL
handshake is too slow.  Examples.  A connect() timeout could happen because of packet
loss, a greeting timeout could happen because of reverse DNS failure on server.

Note 3):
  If the server does not support "RESET CONNECTION STATE", the connections must be torn
down and rebuilt when they're released to the pool.  An option should be supplied, let's
call it "Force Reuse=True", that causes the pool manager to ignore residual server state
(temporary tables, variables, what not) and just keep the connections around without
resetting them.  The purpose of this is to prevent starvation of TCP ports due to
TIME_WAIT.

Note 4):
  This ensures that ready-to-use connections are delivered without delay and with maximum
reliability.

  Time between pings should be customizable to allow tuning for most reliable pool versus
minimum server load.  Henceforth "Ping Interval=5".

  The initial connection counts as a ping, so at least 5 seconds (see above) must pass
before the first ping.

Note 5):
  If a connection dies in this state, it is simply rebuilt.  The ping thread must start
pings asynchronously, so (for example) a multi-thread safe callback is necessary to
handle when a ping fails.  A boolean flag on each connection provides an indication as to
whether a given in-pool connection is currently being pinged.

Note 6):
  Algorithm assures the most distance between the background ping thread and the front of
the connection pool.

  If there are no connections available in the pool, and the maximum size has been
reached, an exception is thrown.

  Otherwise, there will naturally be either an available connection, pinged less than 5
seconds ago available for consumption, or a connection currently being pinged.  Whichever
is available is grabbed.  Once grabbed, the connection is no longer eligible for automatic
pinging.  If the connection grabbed by the pool manager is currently doing a ping, the
pool manager waits for the ping to complete.  Instead of the standard ping-fail handler
(see above) rebuilding the connection if the ping fails, what happens instead when a
connection is grabbed is that any exception is forwarded to the waiting application.

Go! :-).
[13 Nov 2008 6:12] d di
test case

Attachment: issue40684.cs (text/plain), 1001 bytes.

[13 Nov 2008 12:52] d di
Suggestion note 6) continued...

An argument against prioritizing the most recently pinged connection would be that it
makes it harder to figure out if there are more connections in the pool than there need
to be.

For example, if the pool of available connections were prioritized (for client grabbing)
by most-recently-freed connection first, then there would naturally be a tail of
connections that never gets sent into active duty and therefore receive no queries when
the pool is slightly too large.  From there it's simple to add a threshold and depopulate
the pool when the tail connections linger for too long.
[15 Nov 2008 21:42] d di
Talk is cheap..

I've hacked together a working prototype of a connection pool which accomplishes some
critical goals in the most optimal way I could think of.

It focuses on:
 - simplicity,
 - maximum availability of guaranteed "hot" connections,
 - minimal use of resources, and
 - security.

Features overview:
 * Fetching a pooled driver is O(1).
 * Releasing a pooled driver is O(1).
 * Automatic tear-down of unused drivers (can be disabled).
 * Automatic tear-down of unused pools (can be disabled).
 * Connections are pinged once in a while, if they idle in the pool for too long (can be
disabled).
 * For high activity applications, connections are (as a natural consequence of the
algorithms) practically never pinged.
 * Failed in-pool connections are automatically recycled.
 * Failed out-of-pool connections expose the problem to the end application for handling
there.
 * Support for server-side connection state flush.  Must be enabled with a switch when
the server grows support.

For details, see Prototype.cs.
Should be reasonably commented for those unaccustomed to the particular coding style.

In Prototype.cs, the following classes are mostly stubs:
 1) DsnParameters
 2) MySqlDriver
 3) MySqlConnection

Specifically the parts that have no relation to the pooling mechanism are unimplemented.

Of course, this code also relies on the underlying driver's Ping() to fail after 'Connect
Timeout' seconds.  This just fixes the need to do Ping() in the context of returning a
driver to the client.  And also a few other things I noticed along the way.

Improvements compared to current connection pool logic:
 * Passwords and logins are no longer retained in RAM forever.
 * Timeouts will no longer fail if the system time changes.
 * Unused resources are freed for garbage collection.
 * Server problems never affect the pooling logic itself.
 * Connections are kept hot without the need to ping when a client fetches from the
pool.

Separate issues from this one, but I haven't bothered creating bugs in the tracker.  The
last item pretty much corresponds to issue #36194.

The improved logic might also fix issues #40036 (pooled connections do not destruct),
#31996 (resources not released), #37159 (pool fails completely after server restart), and
the Connector/NET part of issue #33540 (connection state flush).
[15 Nov 2008 21:43] d di
improved connection pool logic

Attachment: Prototype.cs (application/octet-stream, text), 22.56 KiB.

[17 Nov 2008 11:23] d di
Revision 2 (testing).

Code changes:
 a) Removed stubs and bindings and added interfaces, so that the connection pool logic is
separate from the core MySQL driver.
 b) Added a null driver + connection string.
 c) Added a test program, uses the connection pool with null driver to blast off a bunch
of connections.
 d) Fixed a couple of bugs found during testing.

To test the new connection pool,
 1) Run the test program.
 2) Copy/paste the debug output from the 'Output' window into a text file.
 3) Use the Unix 'sort' command to sort by connection id and event nr.

Output from the null driver has 4 columns:
	"connection id", "connection event number", "current time" and "event description".

To implement the new connection pool,
 1) Drop the three relevant source code files into Connector project.
 2) Implement IPoolConfiguration and IPoolableDriver on ConnectionString and Driver
classes.
 3) Create a static variable to hold a reference to the connection pool manager and
initialize it statically.
 4) Make sure the Connection class uses ConnectionPool.{Catch,Release} to "create" and
"destroy" drivers.

That's basically it.
[17 Nov 2008 11:24] d di
improved connection pool logic, revision 2

Attachment: Prototype2.zip (application/zip, text), 16.14 KiB.

[19 Nov 2008 11:25] Tonci Grgin
Hi David and thanks for another excellent report.

Would you mind testing 5.2.4 against your code as it's GA. I do not know what to do with
5.0 any more.
[19 Nov 2008 15:02] d di
¿5.2 is GA?  Last I heard from MySQL, the Connectors follow the Server version numbers. 
Server 5.0 is currently GA, so Connector/NET 5.0 must be the GA version.

(Just re-checked, nothing says otherwise at the download pages for either of
Connector/NET {1.0, 5.0, 5.1, 5.2}.)

Re-tested, results are the same for 5.2.4:

========================
Ready to fish connection a second time - press ENTER.

13:03:24  Fishing connection...
13:04:50  Exception: Unable to connect to any of the specified MySQL hosts.
========================

Well, they're worse actually, this time it was 83 seconds over the timeout threshold for
Open().

It also seems like the new driver causes the test application to hang when it's
re-throwing the MySqlException, for some reason...
[19 Nov 2008 15:32] Tonci Grgin
David, connectors do not have to follow server version. As a matter of fact, they usually
don't do so. c/J and c/ODBC GA is 5.1 for example.

Checking your test case.
[19 Nov 2008 15:35] d di
Okay - it would be nice if you could update the download pages to say so.  Or just to say
anything at all about what the difference between the versions are.

Btw (forgot the formalities), Hi Tonci, and no problem with the report ;-)
[10 Dec 2008 12:13] Tony Bedford
The main Connectors documentation has been updated with a note to the effect that:

"Connector version numbers do not correlate with MySQL Server version numbers."

An initial version of a version compatibility matrix has also been added to the
Connectors documentation landing page:

http://dev.mysql.com/doc/refman/6.0/en/connectors-apis.html

These changes will take effect within the next 24 hours.
[15 Jan 13:19] Kim Christensen
Has this been fixed in any version of the Connector/NET?
[10 Mar 14:40] Christos Pavlides
David,
I would be nice if you could provide a proper svn diff file for these changes in order
for the rest of us, and MySql to be able to test this.

Thanks
[10 Mar 14:50] d di
The source package didn't compile after downloading.  Can't remember what the error was,
but clearly if it was supposed to be compileable by a third party, someone hadn't done
their job properly and tested if you could actually follow build instructions on a blank
installation of Windows.  So early on I gave up on using the established driver as a
starting point.

However, the connection pool manager is completely self-contained, as you can see by the
included NullDriver that uses it.  Handy for testing the code, and also means that a diff
is largely unnecessary.  You should just need to add IPoolConfiguration and IPooledDriver
to existing config string and driver classes and click 'Compile', and you should get a
short list of glue methods that need to be implemented.
[10 Mar 15:57] Christos Pavlides
David, thanks for the comments. Just another question though, if you could not compile the
code, are you not using the MySql .Net Connector?, or you just did not implement your
prototype?

Thanks
[10 Mar 16:08] d di
We're using the Connector, and we're still having the problem described in this bug.

The connection pooling code in prototype2.zip works, but was never integrated into the
driver, as the driver did not compile (and the problem causing this was non-trivial to
fix).

Hope that clarifies?
[13 Jul 20:45] d di
driver with alternative connection pool logic integrated

Attachment: connector_net_526_with_alternate_pooling.rar (application/x-rar-compressed, text), 482.09 KiB.

[13 Jul 20:56] d di
Hi Christos (+ rest!)

I've uploaded a .rar (because of size limits in Eventum) of a 5.2.6 driver with the
alternative connection pool logic on top.

You can get a svn diff by unpacking the .rar and diffing against
http://svn.mysql.com/svnpublic/connector-net/tags/5.2.6, revision 1591.

(You should also be able to use svn to rebase it to v6.0.4 if need be.)

Yup :-)
[16 Jul 11:12] Tonci Grgin
David, you sure posed a lot of questions. Main one, as it appears to me, is delay in
releasing connection. I believe this is done because protocol is waiting for this or that
parameter to exceed, so nothing to do with c/NET.
As for alternative approach to pooling, we are discussing some solutions in a bug posted
by Christos (I think). Please see in BugsDB.

I will consult others regarding this.
[16 Jul 13:17] d di
> Main one, as it appears to me, is delay in releasing connection.

Actually, the subject of this bug is a 50 second delay when opening a connection, and the
connection timeout is set to 3 seconds.

> I believe this is done because protocol is waiting for this or
> that parameter to exceed, so nothing to do with c/NET.

Any more details?  Could you post a link to a bug or similar?

> I will consult others regarding this.

Sounds good.  I think that the solution I posted here works well with TCP/IP and
timeouts, but there's still lots of room for improvement.  For example,
 - support for a "Min Pool Free" parameter for enterprise applications that require a
warm connection pool on startup,
 - more advanced asynchronous primitives in the driver would allow using fewer dotnet
threads.

There's also a good deal of design flaws in the wire protocol itself, which makes it
difficult to do things the right way in the driver in a lot of places.
[17 Jul 12:10] Tonci Grgin
Wlad took first look into proposed code changes (which is quite huge). Waiting on more
input.
[17 Jul 12:12] Tonci Grgin
Pooling part proposed changes described in short in Bug#40036, [27 Mar 17:33] Tonci Grgin.
[20 Jul 11:24] Tonci Grgin
Wlad, Reggie, please see Bug#40036 and related reports too.
[30 Jul 0:03] Vladislav Vaintroub
Hi David,
I read the bug description once again. So you problem are firewalls that drop connections
due to long inactivity. I believe the simplest thing possible to prevent it is to use TCP
keepalive on underlying sockets (with some realistic timeout and interval values). I'm
aware it does not solve the problem  when the network is down, but this was not the
actual problem, is this correct?
[30 Jul 1:56] d di
Vladislav> ... use TCP keepalive ...

Back in the day, I tried using keep-alives (by lowering the
Tcpip/Parameters/KeepAliveTime value, I think), but ran into a technicality which meant
that it was unusable as a workaround.  Can't remember exactly what it was..

Vladislav> ... this was not the actual problem, is this correct?

The problem originally described in this bug is that the driver fails to time out as
instructed to in the connection string, under certain circumstances.
[30 Jul 2:08] d di
Hi Vlad (again!)

Rummaged through various old email and actually managed to find out why TCP keep-alive
wasn't useful:

The TCP/IP stack only sends keep-alives if the application itself asks for it using
setsockopt(SO_KEEPALIVE).  Since Connector/NET doesn't do that, adjusting the stack's
timeouts will do absolutely nothing.

Also, I think it's a bit of a stretch to ask everyone using Connector/NET to bend and
mend their TCP/IP stacks.  It would be a lot less strain on users if the pooling
mechanism was fixed once and for all to be sufficiently intelligent to handle network
outages (of whatever kind) (for example using the algorithm / driver posted here).
[30 Jul 17:04] Vladislav Vaintroub
d di,the reason why I was asking is that I'm going implement the keepalive within the
connector itself (not that I would expect you to do something with connetor.net sockets
;)
[30 Jul 17:27] Vladislav Vaintroub
Also I do appreciate the effort that is done with the patch (besides I'd like to note the
bug description, reproducible case are one of the best I ever read:).  I leave it up to
Reggie to decide what to do with the patch (it is big, and unfortunately bigger patches
have less probability to be accepted).

On the other hand TCP itself does provide a mechanism to deal with sort of problems you
encounter without need to "ping" on the application level. So I think it would solve
firewall problem adequately. I do plan to provide keepalive timeout and interval
parameters for .NET users (for Mono it does not seem to be possible), so that users would
not need to change anything in their registry.
[31 Jul 18:13] d di
Ah, I see.  Yeah, sounds doable albeit MS-Windows only with a SIO_KEEPALIVE_VALS ioctl
:-).

I have a feeling that the algorithm presented in this issue is cleaner than the current
official one, and may as such also be a fix for various other pooling bugs.  But I
haven't got anything to substantiate.

I'm a bit curious if there are any technical reasons why you decided to go a different
direction rather than using the code posted here.  If you have found or thought of some
problem with it, please post!

WRT size, most of the patch is ripping out the pooling logic and adding a couple of files
containing new logic.  The rest of it is related bugfixes here and there.  Let me know if
you need hints on why parts of it are there, or whatever.

Also, isn't this what a test suite is for?  Apply the patch, if it passes all tests,
nothing important broke ;)
[1 Aug 3:04] Vladislav Vaintroub
>Ah, I see.  Yeah, sounds doable albeit MS-Windows only with a >SIO_KEEPALIVE_VALS ioctl
> :-)

Near:) I thought about something like described here http://blogs.msdn.com/lcleeton/ .
the problem with Mono is that Socket.IOControl is not implemented. So we may leave  Mono
users only with keepalive on/off for the time being  and .NET users will also get the
keepalive timeout. 

>I have a feeling that the algorithm presented in this issue is cleaner than >the current
official one, and may as such also be a fix for various other >pooling bugs.  But I
haven't got anything to substantiate.

>I'm a bit curious if there are any technical reasons why you decided to go a different
direction rather than using the code posted here.  If you have found or thought of some
problem with it, please post!

Well, the keepalive path is result of my recent study how tcp works:) The general problem
with sockets is that the blocking calls like send/recv could hang almost infinitely,
unless send/recv timeouts are used or TCP finds an error when sending keepalive packet.
read/write timeouts are the most clean solution and we are currently looking to implement
that, and for shared memory and pipe connections too. Keepalive has yet another fine
property, besides being able to detect network error, and this is exactly your case, it
prevents firewalls from dropping inactive connections. This could be a biased opinion,
but I do prefer a protocol layer ping to application level ping, this certainly makes
programming easier and removes some burden from the server to handle ping request and
from client to check for the result of that requests.

So, when timeouts are fixed, we could take a look at other problems with pooling..

Ok, to the problems in the patch : using a thread per driver, and that thread ill also
sleep most of the time. Hmm... not sure it can be called the most efficient resource
usage
[3 Aug 13:58] d di
> the problem with Mono is that
> Socket.IOControl is not implemented

I guess it could always be probed for with reflection and turned on when/if the method
pops into existence.

Let me know when you have some code ready, so we can test against the original problem
examined in this bug report.

> Ok, to the problems in the patch

Yep, let's address those.

> not sure it can be called the
> most efficient resource usage

Memory-wise, threads are limited to using a few kilobytes of RAM each.  This is done by
adjusting their stack very low.  In effect they consume so little that memory pressure is
a non-issue.

Currently set to 64 KiB per thread, although I haven't found a lower limit yet so there's
probably no hinderance to decreasing it much further if you wanted to.

> using a thread per driver, and that
> thread ill also sleep most of the time.

Actually, that depends on the situation.  With a working server and low RTT, yes, threads
will be asleep most of the time, only to be awoken for short periods of activity when
they're needed.

But when half of the connections are unavailable, due to for example a firewall, 50% of
the threads will actually be running.  If a server is completely gone, it may be as much
as 100% of the threads that are active.

(Having 100% of the threads active does not mean that they're actually consuming CPU
cycles.  In the case of every thread active, they'll be sitting in a blocking socket call
waiting for a ping response, so even then they'll still consume ~0% CPU.  But none of them
will be in the Sleep state in this situation.)

The scheduler will have a few more threads to juggle, yes.  I imagine that the CPU
scheduler is pretty darn efficient when it comes to keeping sleeping and io-blocked
threads out of the scheduling loop, as far as I can imagine this is not a problem.  I
haven't seen any issues here while testing, either.

It is a necessity to have one thread per driver given how the current driver works.  I
didn't want to rewrite the whole driver, which is why I settled on this solution.

It would be possible to do a later rework of the driver and pooling such that there is
primitives in the driver for doing non-blocking BeginConnect() and BeginPing(), and just
one thread in the pooling algorithm that handles every single driver.  Bigger change,
less likely to be accepted ;-), but definitely possible.
[4 Aug 0:49] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/79964

723 Vladislav Vaintroub	2009-08-04
      Bug #40684 Connector/NET does not obey Connect Timeout.
      The actual problem in the bug report is that pooled connections are dropped  by
firewalls due to long inactivity and that results in longer connect times.
      
      The patch introduces 2 new parameters: "keepalive" (boolean, sets keepalive option
on socket) and "keepalive timeout" (integer, number of seconds before first keepalive
packet is sent).  "keepalive timeout" does not automatically switch "keepalive" on, both
options need to be specified to set the timeout.
      
      One can use this parameters to prevent inactivity drops and also to detect broken
connections faster.
      Note that  "Keepalive timeout"  is not currently available in Compact Framework nor
in Mono ( will be silently ignored there and connection will use the default timeout what
is normally 2 hours)
[4 Aug 22:45] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/80086

738 Vladislav Vaintroub	2009-08-04
      Bug #40684 Connector/NET does not obey Connect Timeout.
      The actual problem in the bug report is that pooled connections are dropped  by
      firewalls due to long inactivity and that results in longer connect times.
            
      The patch introduces new parameter: "keepaliv" (integer, number of seconds before
first keepalive packet is sent). 
      One can use this parameters to prevent inactivity drops and also to detect broken
      connections faster. Note that setting non-default timeout  is currently not
possible with  Compact Framework nor with in Mono. Here, if keepalive option is used, the
timeout would be defined by the OS default, which is normally 2 hours.
[5 Aug 0:37] Vladislav Vaintroub
Hi David,
the keepalive option is now in the repository. It is hosted in Launchpad this days, so
you need to install bzr and do 
bzr branch lp:connectornet/6.0
to get it.

(you can try  bzr branch lp:connectornet/6.1 or just bzr branch lp:connectornet
to get trunk)

The new option that is aimed to fix your problem is "keepalive=<seconds>".
with value of 1 for seconds I get the IO exception in your test case almost immediately.
Admittedly, it is not how the option should be used in your case, as you just want to
prevent firewalls to do bad things with the connection, so you might want to have
timeouts longer that 1 second to reduce network traffic.

The general fix for timeouts is a tick larger and it will probably go into 6.2. The patch
for it is on http://lists.mysql.com/commits/79914, it has not been reviewed yet and hence
not pushed. If you want to play with it, apply against current trunk (i.e bzr branch
lp:connectornet)

Regarding large number of threads- I do not think I can judge about how schedulers deal
with sleeping threads, and I noticed you used  64K stacks,  is the minimum possible on
Windows. But I do think unless there is a very good reason for it, we should follow
general recommendations and use thread pool for background tasks when possible (when not
possible, maybe own thread pools with thread = <small constant>*CPUs)
[5 Aug 12:46] Tony Bedford
An entry was added to the 6.0.5 and 6.1.1 changelogs:

When reusing a connection from the connection pool, Connect Timeout was not obeyed when
trying to reopen the connection. Although Connect Timeout was set to 3 seconds, it took
53 seconds before the call to Open() timed out.
[6 Aug 9:08] d di
> I noticed you used  64K stacks,
> is the minimum possible on Windows.

Could be.  Judging by the documentation, if the interpreter happens to run dotnet threads
as native OS threads rather than user-mode threads, and the OS granularity dictates a
minimum of 64 KiB.  The documentation tends to be imprecise, and I haven't tested how the
de-facto Microsoft implementation works on any version of MS-Windows, so I don't know for
sure what the minimum is.

> But I do think unless there is a very
> good reason for it, we should follow
> general recommendations and use thread
> pool for background tasks when possible
> (when not possible, maybe own thread pools
> with thread = <small constant>*CPUs)

My goal was a solution that's guaranteed to work.  Thread pools introduce added
complexity and overhead that makes it harder to predict how the final algorithm executes,
which is why I avoided it.

Fast async primitives for connect and ping would allow a logic change so that only one
thread is needed for handling drivers, completely eliminating the need for a thread pool,
and I think that's a better way.  If avoiding threads indeed turns out to be desirable
under actual usage.

> bzr branch lp:connectornet/6.0 to get it.

Do you have a buildbot that produces binaries somewhere?

> The new option that is aimed to fix your
> problem is "keepalive=<seconds>".  with
> value of 1 for seconds I get the IO exception
> in your test case almost immediately.

Okay, so your approach fixes this problem but causes another one: increased network
traffic.

Empty packets, but still one per MySQL connection per second, and each packet still has
to be handled by any intermediary device that handles traffic on a per-packet or higher
level.

Doesn't seem fair to our network administrators to cause such a massive overhead,
especially considering that we have a fine solution that works without it.

> Admittedly, it is not how the option should be used in
> your case, as you just want to prevent firewalls to do
> bad things with the connection, so you might want to
> have timeouts longer that 1 second to reduce network
> traffic.

I assume that by increasing the timeout, you would again cause the test case in this
issue to fail, and that's why you selected 1 second, is this correct?

> The general fix for timeouts is a tick larger
> and it will probably go into 6.2.

Funny story, I heard the exact same said about this issue and 6.0 a few months before
that version was released :-).

On a serious note, why are you closing this issue if it's not fixed yet until 6.2?
[6 Aug 10:41] Vladislav Vaintroub
> Do you have a buildbot that produces binaries somewhere?
No, unfortunately launchpad does not provide build bots (yet?). I can send you the dll if
you wish.

>Okay, so your approach fixes this problem but causes another one:
increased network traffic

Exactly as yours:) You're using driver.Ping() on inactivity, I'm using keepalive
packet... The keepalive timeout is configurable, so network traffic is controlable. The 1
second keepalive is an example that I would not recommend anyone to use when it should
only stop firewalls. Maybe the 2 hours are ok, maybe 30 minutes or maybe 10 minutes, I do
not know. People  get a parameter they can adapt for their environments.

>I assume that by increasing the timeout, you would again cause the test
>case in this issue to fail, and that's why you selected 1 second, is
>this correct?

I use 1 sec  to verify that keepalive works as advertised.I like shorter tests:)

>On a serious note, why are you closing this issue if it's not fixed yet
until 6.2?

Well, if you read the bug title and the bug description you'll see slight descrepancy.
The bug title says "connection timeout is not obeyed", the longer description says
"firewalls drop connection because of inactivity". keepalive fixes the second problem,
which I understood is the real problem in your case (and which your own patch was aimed
to fix). We normally do not solve more than a single problem at once. So I'm closing this
bug, it fixes problem in "long description".

The blueprint for general timeout handling is created here
https://blueprints.launchpad.net/connectornet/+spec/new-command-timeout. It also has the
code attached to it and you get the branch with
bzr branch lp:~vvaintroub/connectornet/timeouts  This blueprint is dedicated solely to
timeouts. It is not about idle connections/firewalls nor about pooling (just in case:)

But sure enough, you can also open another bug if the fix with keepalive parameter does
not solve the problem you _described_ (which is not exactly the title of this bug:)
[6 Aug 12:28] d di
> I can send you the dll if you wish.

Thanks, please do!

>> Okay, so your approach fixes this problem
>> but causes another one: increased network traffic
>
> Exactly as yours:)

I disagree, there's a considerable difference between the defaults suggested: a ping
every 30 seconds versus one every second.

>> I assume that by increasing the timeout, you would
>> again cause the test case in this issue to fail,
>> and that's why you selected 1 second, is this correct?
>
> I use 1 sec  to verify that keepalive works as advertised.

That didn't answer the question though: is the issue fixed or not if the keepalive is set
to a reasonable value rather than 1 second?

>> On a serious note, why are you closing
>> this issue if it's not fixed yet until 6.2?
>
> Well, if you read the bug title and the bug
> description you'll see slight descrepancy. The
> bug title says "connection timeout is not obeyed",
> the longer description says "firewalls drop
> connection because of inactivity".

That's a bit of a misunderstanding of how issue reporting is done.  It's not a "long
description" that you're quoting, it's a "test case".

A test case may not completely, accurately or in whole represent the issue at hand.  It
is merely a piece of code provided as proof that there is an actual problem.

(With simpler issues, the test case may happen to cover the entirety of the problem with
complete accuracy.  That's more of a lucky circumstance.)

> keepalive fixes the second problem, which I
> understood is the real problem in your case
> (and which your own patch was aimed to fix).

The end goal is that every single Connector/NET timeout should work as advertised.  That
is what the patch posted in this issue accomplishes.

It also represents an overall algorithm that provides a much better guarantee for warm
connections than the currently used algorithm.

> We normally do not solve more than a single
> problem at once. So I'm closing this bug, it
> fixes problem in "long description".

Okay.  While fixing the test case rather than the underlying design problem doesn't
really solve my issue, I guess it's fair enough to close an issue once there's no longer
any evidence that a problem exists.

> The blueprint for general timeout handling is created here
> https://blueprints.launchpad.net/connectornet/+spec/new-command-timeout.

I'm glad to see that you're working on it!

(I fail to understand why you'd want to re-solve a problem that is already solved by the
patch submitted in this issue, rather than working with the patch, but whatever.)

> But sure enough, you can also open another
> bug if the fix with keepalive parameter does
> not solve the problem you _described_ (which
> is not exactly the title of this bug:)

I think I've covered this further above.  In any case, here is a more accurate
description for this issue:

"Every single one of the specifiable timeouts in Connector/NET fail to work under a
myriad of various common circumstances."

"I'll refrain from posting a test case.  It should be obvious to anyone that with the
sheer amount of fail given by the description, this is not a simple bug but a completely
broken design."

I hope that the revised description can provide you with a clear perspective on what the
problem at hand is :-).
[6 Aug 16:27] Vladislav Vaintroub
David, hopefully we can continue this discussion offline withou hijacking the bug that is
in documenting state.

The dll is sent, together with source.

>>> Okay, so your approach fixes this problem
>>> but causes another one: increased network traffic
>>
>> Exactly as yours:)

>I disagree, there's a considerable difference between the defaults
>suggested: a ping every 30 seconds versus one every second.

I did not suggest that. Again, this is the value I used for quick testing, nothing more
implied. If you want 30 seconds timeout, use it, it is possible. ´

>That didn't answer the question though: is the issue fixed or not if
>the keepalive is set to a reasonable value rather than 1 second?

IF all timeouts issue would be fixed with this patch I would not need to create
blueprints. OF course, if I set the timeout to larger value, broken connection is found
later.In your proposed solution, there is neither magic that would determine a broken
connection instantl (you'd need to ping the connection instantly for this)

The firewall dropping connections should be fixed though.  I re-read the bug description.
It tells a story of firewalls dropping connections and repeats the word "firewall" 4
times, so I have hard time to believe it was not your issue. 

The patch you attached fixes firewall problem by keeping connection alive, and it is
using driver.Ping() is not it? Besides it Ping()'s to detect broken connections. So now
what TCP keepalive is designed for? Exactly for the same thing, detecting dead peers and
preventing connection drops.It does not need many threads, it does not need programming
efforts and moreover, I can rely on the fact that keepalive doesnot hang. 

I cannot rely on driver.Ping() that it does not hang. Why not? a Ping is a network send()
and a network recv(). And if connection is broken while it recv()'s , TCP stack is not
guaranteed to abort your connecton. It can hang infinitely. Please check the relevant
discussion http://forums.sun.com/thread.jspa?threadID=539297&start=0&tstart=0

So now imagine unlucky situation where all of the numerous threads in your implementation
Ping and all of them hang.

Contrast with keepalive timeout. Even if connection is half-open and hangs in recv(), TCP
will abort the connection after failing retransmissions.

So now I hope I explained why keepalive helps not only to keep the connection online but
also to resolve hangs. This option however is not a full magic, to support
ConnectionTimeout and CommandTimeout we need something better  and this is the work I
mentioned It cannot be resolved by having application level Ping's as in the patch
attached. This needs to be solved be setting read and write timeouts on sockets instead
and let the TCP stack handle those issues.
[6 Aug 21:13] d di
> hopefully we can continue this discussion
> offline withou hijacking the bug

Ah yes, definitely, this is hardly the place for discussing what a test case is and is
not.

> IF all timeouts issue would be fixed with
> this patch I would not need to create blueprints.

Unsure which one of the patches we are talking about here.
I believe all timeout issues are fixed in the patch I submitted, at least.

> I cannot rely on driver.Ping() that it does not
> hang. Why not? a Ping is a network send() and a
> network recv(). And if connection is broken while
> it recv()'s , TCP stack is not guaranteed to abort
> your connecton. It can hang infinitely.

Not if you're using my patch.  There's a network receive timeout which is used to ensure
exactly that this does not happen.

> So now imagine unlucky situation where all of the
> numerous threads in your implementation Ping and
> all of them hang.

If the network is completely gone, each thread will block for exactly [Connect Timeout]
seconds.  After that, an Exception is delivered to the application, if it has acquired
the particular driver that is failing.  If not, the driver is simply removed from the
pool.

> Contrast with keepalive timeout. Even if connection
> is half-open and hangs in recv(), TCP will abort the
> connection after failing retransmissions.

End result is the same, except that the patch submitted in this issue has one less
timeout value for the user to keep track of.

> So now I hope I explained why keepalive helps not only
> to keep the connection online but also to resolve hangs.
> This option however is not a full magic, to support
> ConnectionTimeout and CommandTimeout we need something
> better and this is the work I mentioned It cannot be
> resolved by having application level Ping's as in the
> patch attached. This needs to be solved be setting read
> and write timeouts on sockets instead and let the TCP
> stack handle those issues.

Sure it can?  Try the patch, it's already solved.
[7 Aug 0:10] Vladislav Vaintroub
Oh, I see. It is really better to send patches as unified diffs,I was stuck in
ConnectionPooling, and you got a lot of different other changes as well.

To the connection pooling: Am I right saying that the threads merely simulate in
application non-configurable 10 seconds TCP keepalive? Instead we can do
user-configurable TCP keepalive, that by default not switched at all, without  threads?
You also do clean up for connections that are expired, but that again does not justify
using a thread per connection, IMO. Oracle .NET connector documents that it cleans up
connections with expired ConnectionLifetime every 3 minutes, I bet they use an ordinary
Timer.

To the timeouts, yes, I see now you use them in
 baseStream.ReadTimeout = left
in MySqlStream.cs

The problems here is that not work for named pipes and shared memory streams, that also
need to support command timeouts. This would throw NotImplemented exception.
[7 Aug 2:06] d di
> It is really better to send patches as unified diffs,

Code based on http://svn.mysql.com/svnpublic/connector-net/tags/5.2.6, revision 1591.  A
unified diff is produced by dumping the submitted archive on top of a checkout of that
revision, and running "svn diff".

> non-configurable 10 seconds TCP keepalive?

Actually, looking at the code again, the timeout is configurable.

> Instead we can do user-configurable TCP keepalive,
> that by default not switched at all, without threads?

Definitely, sounds good.  I agree that the keepalive patch is a very positive step in the
right direction.

Only problem is it doesn't fix other timeout problems addressed by the patch submitted in
this issue, so still looking forward to a complete solution including the
new-command-timeout blueprint.

> You also do clean up for connections that are expired, but that
> again does not justify using a thread per connection, IMO.

Right, agreed, the threads are just there for satisfying the driver's blocking calls. 
Cleanup code and other driver management were put in the same place for readability, but
could surely have been relocated elsewhere.

> To the timeouts, yes, I see now you use them [...]
> The problems here is that not work for named pipes
> and shared memory streams [...]

Neither does TCP keepalives, I imagine, but yes you're right (I think - haven't tested).

> This would throw NotImplemented exception.

Okay, that has to be fixed.  I'll add a correction to disable setting the timeout for np
and shm.  Thanks!
[10 Aug 17:17] Tony Bedford
Changelog updated as requested:

In the case of long network inactivity, especially when connection pooling was used,
connections were sometimes dropped, for example, by firewalls.

Note: The bugfix introduced a new keepalive parameter, which prevents disconnects by
sending an empty TCP packet after a specified timeout.
[11 Aug 11:01] d di
> In the case of long network inactivity,
> especially when connection pooling was used,
> connections were sometimes dropped,
> for example, by firewalls.

Hi Tony,

Hopefully the committed code also fixes the 53-second timeout when there should be a
3-second timeout such as described by the test case in this issue.