MySQL Bugs: #40684: Connector/NET does not obey Connect Timeout

Bug #40684	Connector/NET does not obey Connect Timeout
Submitted:	13 Nov 2008 5:08	Modified:	11 Aug 2009 9:01
Reporter:	d di (Basic Quality Contributor)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	Connector / NET	Severity:	S1 (Critical)
Version:	Any	OS:	Any (Tested on 5.0.9)
Assigned to:	Vladislav Vaintroub	CPU Architecture:	Any
Tags:	qc

Description:
There's a web server.  There's a MySQL database server.  In between them sits a firewall.  In front of the web server are customers.  The customers complain that applications sporadically takes several minutes to load.

Checking the trace logs, the time is spent in MySqlConnection.Open().

Debugging commences.  First, try to connect from a cold versus a warm connection pool (press ENTER when asked):

========================================
15:10:05  Info: Connect Timeout=3
15:10:06  Fishing connection...
15:10:06  Success (id: 525).  Checking state...
15:10:06  Done (state: Open).  Releasing connection...

Ready to fish connection a second time - press ENTER.

15:10:08  Fishing connection...
15:10:08  Success (id: 525).  Checking state...
15:10:08  Done (state: Open).  Releasing connection...
========================================

Awesome.

There's a firewall in between the two hosts, and it will drop the pooled connection's TCP circuit after some time, to avoid session hijacking or whatever.  It's a common thing.  Simple firewalls will keep the connection alive if there's a MySQL wire protocol ping happening inside, more advanced firewalls won't.

We'll simulate the firewall by yanking the cable to the database server.  (Alternatively, you can open a console on the database server and null route the "web server"'s IP address when it says "yank cable" below - same effect.)

Now to make sure the connection timeout is working.  First, from a cold pool:

a) yank the cable to the database server and
b) run the test.

========================================
15:12:18  Info: Connect Timeout=3
15:12:18  Fishing connection...
15:12:22  Exception: Unable to connect to any of the specified MySQL hosts.
========================================

Notice the time stamps.  Times out after 3 seconds.  Perfect.

Next, test with a warm cache.  First:

c) reattach cable to database server
d) run the test (don't press ENTER).

========================================
15:14:28  Info: Connect Timeout=3
15:14:28  Fishing connection...
15:14:28  Success (id: 43).  Checking state...
15:14:28  Done (state: Open).  Releasing connection...

Ready to fish connection a second time - press ENTER.
========================================

Ok, it's warm... Now:

e) yank the cable to the database server again, then
f) press ENTER to continue the test.

========================================
15:14:35  Fishing connection...
15:15:28  Exception: Unable to connect to any of the specified MySQL hosts.
========================================

Fail.  53 seconds passed before the call to Open() timed out, which is 50 seconds too much.

Please fix :-).

How to repeat:
See description, use soon-to-be-attached test script.

Suggested fix:
How a connection pool should work:
 1) Keep a number of connections open.
 2) Connection attempts time out after "Connect Timeout" seconds.
 3) When connections are released back to the pool, run "RESET CONNECTION STATE" on server.
 4) While connections are in pool, run a wire protocol ping every 5 seconds to ensure they're hot.
 5) The wire protocol ping should time out according to the "Connect Timeout" variable.
 6) When a connection is fished, deliver the one that was most recently pinged.

Note 1):
  Minimum and maximum number of connections in pool must be customizable (as is now - great).

Note 2):
  "Connect Timeout" must work, regardless if a TCP connect() is too slow, or a MySQL handshake is too slow.  Examples.  A connect() timeout could happen because of packet loss, a greeting timeout could happen because of reverse DNS failure on server.

Note 3):
  If the server does not support "RESET CONNECTION STATE", the connections must be torn down and rebuilt when they're released to the pool.  An option should be supplied, let's call it "Force Reuse=True", that causes the pool manager to ignore residual server state (temporary tables, variables, what not) and just keep the connections around without resetting them.  The purpose of this is to prevent starvation of TCP ports due to TIME_WAIT.

Note 4):
  This ensures that ready-to-use connections are delivered without delay and with maximum reliability.

  Time between pings should be customizable to allow tuning for most reliable pool versus minimum server load.  Henceforth "Ping Interval=5".

  The initial connection counts as a ping, so at least 5 seconds (see above) must pass before the first ping.

Note 5):
  If a connection dies in this state, it is simply rebuilt.  The ping thread must start pings asynchronously, so (for example) a multi-thread safe callback is necessary to handle when a ping fails.  A boolean flag on each connection provides an indication as to whether a given in-pool connection is currently being pinged.

Note 6):
  Algorithm assures the most distance between the background ping thread and the front of the connection pool.

  If there are no connections available in the pool, and the maximum size has been reached, an exception is thrown.

  Otherwise, there will naturally be either an available connection, pinged less than 5 seconds ago available for consumption, or a connection currently being pinged.  Whichever is available is grabbed.  Once grabbed, the connection is no longer eligible for automatic pinging.  If the connection grabbed by the pool manager is currently doing a ping, the pool manager waits for the ping to complete.  Instead of the standard ping-fail handler (see above) rebuilding the connection if the ping fails, what happens instead when a connection is grabbed is that any exception is forwarded to the waiting application.

Go! :-).

test case

Attachment: issue40684.cs (text/plain), 1001 bytes.

Suggestion note 6) continued...

An argument against prioritizing the most recently pinged connection would be that it makes it harder to figure out if there are more connections in the pool than there need to be.

For example, if the pool of available connections were prioritized (for client grabbing) by most-recently-freed connection first, then there would naturally be a tail of connections that never gets sent into active duty and therefore receive no queries when the pool is slightly too large.  From there it's simple to add a threshold and depopulate the pool when the tail connections linger for too long.

Talk is cheap..

I've hacked together a working prototype of a connection pool which accomplishes some critical goals in the most optimal way I could think of.

It focuses on:
- simplicity,
- maximum availability of guaranteed "hot" connections,
- minimal use of resources, and
- security.

Features overview:
* Fetching a pooled driver is O(1).
* Releasing a pooled driver is O(1).
* Automatic tear-down of unused drivers (can be disabled).
* Automatic tear-down of unused pools (can be disabled).
* Connections are pinged once in a while, if they idle in the pool for too long (can be disabled).
* For high activity applications, connections are (as a natural consequence of the algorithms) practically never pinged.
* Failed in-pool connections are automatically recycled.
* Failed out-of-pool connections expose the problem to the end application for handling there.
* Support for server-side connection state flush. Must be enabled with a switch when the server grows support.

For details, see Prototype.cs.
Should be reasonably commented for those unaccustomed to the particular coding style.

In Prototype.cs, the following classes are mostly stubs:
1) DsnParameters
2) MySqlDriver
3) MySqlConnection

Specifically the parts that have no relation to the pooling mechanism are unimplemented.

Of course, this code also relies on the underlying driver's Ping() to fail after 'Connect Timeout' seconds. This just fixes the need to do Ping() in the context of returning a driver to the client. And also a few other things I noticed along the way.

Improvements compared to current connection pool logic:
* Passwords and logins are no longer retained in RAM forever.
* Timeouts will no longer fail if the system time changes.
* Unused resources are freed for garbage collection.
* Server problems never affect the pooling logic itself.
* Connections are kept hot without the need to ping when a client fetches from the pool.

Separate issues from this one, but I haven't bothered creating bugs in the tracker. The last item pretty much corresponds to issue #36194.

The improved logic might also fix issues #40036 (pooled connections do not destruct), #31996 (resources not released), #37159 (pool fails completely after server restart), and the Connector/NET part of issue #33540 (connection state flush).

improved connection pool logic

Attachment: Prototype.cs (application/octet-stream, text), 22.56 KiB.

Revision 2 (testing).

Code changes:
 a) Removed stubs and bindings and added interfaces, so that the connection pool logic is separate from the core MySQL driver.
 b) Added a null driver + connection string.
 c) Added a test program, uses the connection pool with null driver to blast off a bunch of connections.
 d) Fixed a couple of bugs found during testing.

To test the new connection pool,
 1) Run the test program.
 2) Copy/paste the debug output from the 'Output' window into a text file.
 3) Use the Unix 'sort' command to sort by connection id and event nr.

Output from the null driver has 4 columns:
	"connection id", "connection event number", "current time" and "event description".

To implement the new connection pool,
 1) Drop the three relevant source code files into Connector project.
 2) Implement IPoolConfiguration and IPoolableDriver on ConnectionString and Driver classes.
 3) Create a static variable to hold a reference to the connection pool manager and initialize it statically.
 4) Make sure the Connection class uses ConnectionPool.{Catch,Release} to "create" and "destroy" drivers.

That's basically it.

improved connection pool logic, revision 2

Attachment: Prototype2.zip (application/zip, text), 16.14 KiB.

Hi David and thanks for another excellent report.

Would you mind testing 5.2.4 against your code as it's GA. I do not know what to do with 5.0 any more.

¿5.2 is GA?  Last I heard from MySQL, the Connectors follow the Server version numbers.  Server 5.0 is currently GA, so Connector/NET 5.0 must be the GA version.

(Just re-checked, nothing says otherwise at the download pages for either of Connector/NET {1.0, 5.0, 5.1, 5.2}.)

Re-tested, results are the same for 5.2.4:

========================
Ready to fish connection a second time - press ENTER.

13:03:24  Fishing connection...
13:04:50  Exception: Unable to connect to any of the specified MySQL hosts.
========================

Well, they're worse actually, this time it was 83 seconds over the timeout threshold for Open().

It also seems like the new driver causes the test application to hang when it's re-throwing the MySqlException, for some reason...

David, connectors do not have to follow server version. As a matter of fact, they usually don't do so. c/J and c/ODBC GA is 5.1 for example.

Checking your test case.

Okay - it would be nice if you could update the download pages to say so.  Or just to say anything at all about what the difference between the versions are.

Btw (forgot the formalities), Hi Tonci, and no problem with the report ;-)

The main Connectors documentation has been updated with a note to the effect that:

"Connector version numbers do not correlate with MySQL Server version numbers."

An initial version of a version compatibility matrix has also been added to the Connectors documentation landing page:

http://dev.mysql.com/doc/refman/6.0/en/connectors-apis.html

These changes will take effect within the next 24 hours.

Has this been fixed in any version of the Connector/NET?

David,
I would be nice if you could provide a proper svn diff file for these changes in order for the rest of us, and MySql to be able to test this.

Thanks

The source package didn't compile after downloading.  Can't remember what the error was, but clearly if it was supposed to be compileable by a third party, someone hadn't done their job properly and tested if you could actually follow build instructions on a blank installation of Windows.  So early on I gave up on using the established driver as a starting point.

However, the connection pool manager is completely self-contained, as you can see by the included NullDriver that uses it.  Handy for testing the code, and also means that a diff is largely unnecessary.  You should just need to add IPoolConfiguration and IPooledDriver to existing config string and driver classes and click 'Compile', and you should get a short list of glue methods that need to be implemented.

David, thanks for the comments. Just another question though, if you could not compile the code, are you not using the MySql .Net Connector?, or you just did not implement your prototype?

Thanks

We're using the Connector, and we're still having the problem described in this bug.

The connection pooling code in prototype2.zip works, but was never integrated into the driver, as the driver did not compile (and the problem causing this was non-trivial to fix).

Hope that clarifies?

driver with alternative connection pool logic integrated

Attachment: connector_net_526_with_alternate_pooling.rar (application/x-rar-compressed, text), 482.09 KiB.

Hi Christos (+ rest!)

I've uploaded a .rar (because of size limits in Eventum) of a 5.2.6 driver with the alternative connection pool logic on top.

You can get a svn diff by unpacking the .rar and diffing against http://svn.mysql.com/svnpublic/connector-net/tags/5.2.6, revision 1591.

(You should also be able to use svn to rebase it to v6.0.4 if need be.)

Yup :-)

David, you sure posed a lot of questions. Main one, as it appears to me, is delay in releasing connection. I believe this is done because protocol is waiting for this or that parameter to exceed, so nothing to do with c/NET.
As for alternative approach to pooling, we are discussing some solutions in a bug posted by Christos (I think). Please see in BugsDB.

I will consult others regarding this.

> Main one, as it appears to me, is delay in releasing connection.

Actually, the subject of this bug is a 50 second delay when opening a connection, and the connection timeout is set to 3 seconds.

> I believe this is done because protocol is waiting for this or
> that parameter to exceed, so nothing to do with c/NET.

Any more details?  Could you post a link to a bug or similar?

> I will consult others regarding this.

Sounds good.  I think that the solution I posted here works well with TCP/IP and timeouts, but there's still lots of room for improvement.  For example,
 - support for a "Min Pool Free" parameter for enterprise applications that require a warm connection pool on startup,
 - more advanced asynchronous primitives in the driver would allow using fewer dotnet threads.

There's also a good deal of design flaws in the wire protocol itself, which makes it difficult to do things the right way in the driver in a lot of places.

Wlad took first look into proposed code changes (which is quite huge). Waiting on more input.

Pooling part proposed changes described in short in Bug#40036, [27 Mar 17:33] Tonci Grgin.

Wlad, Reggie, please see Bug#40036 and related reports too.

Hi David,
I read the bug description once again. So you problem are firewalls that drop connections due to long inactivity. I believe the simplest thing possible to prevent it is to use TCP keepalive on underlying sockets (with some realistic timeout and interval values). I'm aware it does not solve the problem  when the network is down, but this was not the actual problem, is this correct?

Vladislav> ... use TCP keepalive ...

Back in the day, I tried using keep-alives (by lowering the Tcpip/Parameters/KeepAliveTime value, I think), but ran into a technicality which meant that it was unusable as a workaround.  Can't remember exactly what it was..

Vladislav> ... this was not the actual problem, is this correct?

The problem originally described in this bug is that the driver fails to time out as instructed to in the connection string, under certain circumstances.

Hi Vlad (again!)

Rummaged through various old email and actually managed to find out why TCP keep-alive wasn't useful:

The TCP/IP stack only sends keep-alives if the application itself asks for it using setsockopt(SO_KEEPALIVE).  Since Connector/NET doesn't do that, adjusting the stack's timeouts will do absolutely nothing.

Also, I think it's a bit of a stretch to ask everyone using Connector/NET to bend and mend their TCP/IP stacks.  It would be a lot less strain on users if the pooling mechanism was fixed once and for all to be sufficiently intelligent to handle network outages (of whatever kind) (for example using the algorithm / driver posted here).

d di,the reason why I was asking is that I'm going implement the keepalive within the connector itself (not that I would expect you to do something with connetor.net sockets ;)

Also I do appreciate the effort that is done with the patch (besides I'd like to note the bug description, reproducible case are one of the best I ever read:).  I leave it up to Reggie to decide what to do with the patch (it is big, and unfortunately bigger patches have less probability to be accepted).

On the other hand TCP itself does provide a mechanism to deal with sort of problems you encounter without need to "ping" on the application level. So I think it would solve firewall problem adequately. I do plan to provide keepalive timeout and interval parameters for .NET users (for Mono it does not seem to be possible), so that users would not need to change anything in their registry.

Ah, I see.  Yeah, sounds doable albeit MS-Windows only with a SIO_KEEPALIVE_VALS ioctl :-).

I have a feeling that the algorithm presented in this issue is cleaner than the current official one, and may as such also be a fix for various other pooling bugs.  But I haven't got anything to substantiate.

I'm a bit curious if there are any technical reasons why you decided to go a different direction rather than using the code posted here.  If you have found or thought of some problem with it, please post!

WRT size, most of the patch is ripping out the pooling logic and adding a couple of files containing new logic.  The rest of it is related bugfixes here and there.  Let me know if you need hints on why parts of it are there, or whatever.

Also, isn't this what a test suite is for?  Apply the patch, if it passes all tests, nothing important broke ;)

>Ah, I see.  Yeah, sounds doable albeit MS-Windows only with a >SIO_KEEPALIVE_VALS ioctl
> :-)

Near:) I thought about something like described here http://blogs.msdn.com/lcleeton/ . the problem with Mono is that Socket.IOControl is not implemented. So we may leave  Mono users only with keepalive on/off for the time being  and .NET users will also get the keepalive timeout. 

>I have a feeling that the algorithm presented in this issue is cleaner than >the current official one, and may as such also be a fix for various other >pooling bugs.  But I haven't got anything to substantiate.

>I'm a bit curious if there are any technical reasons why you decided to go a different direction rather than using the code posted here.  If you have found or thought of some problem with it, please post!

Well, the keepalive path is result of my recent study how tcp works:) The general problem with sockets is that the blocking calls like send/recv could hang almost infinitely, unless send/recv timeouts are used or TCP finds an error when sending keepalive packet. read/write timeouts are the most clean solution and we are currently looking to implement that, and for shared memory and pipe connections too. Keepalive has yet another fine property, besides being able to detect network error, and this is exactly your case, it prevents firewalls from dropping inactive connections. This could be a biased opinion, but I do prefer a protocol layer ping to application level ping, this certainly makes programming easier and removes some burden from the server to handle ping request and from client to check for the result of that requests.

So, when timeouts are fixed, we could take a look at other problems with pooling..

Ok, to the problems in the patch : using a thread per driver, and that thread ill also sleep most of the time. Hmm... not sure it can be called the most efficient resource usage

> the problem with Mono is that
> Socket.IOControl is not implemented

I guess it could always be probed for with reflection and turned on when/if the method pops into existence.

Let me know when you have some code ready, so we can test against the original problem examined in this bug report.

> Ok, to the problems in the patch

Yep, let's address those.

> not sure it can be called the
> most efficient resource usage

Memory-wise, threads are limited to using a few kilobytes of RAM each.  This is done by adjusting their stack very low.  In effect they consume so little that memory pressure is a non-issue.

Currently set to 64 KiB per thread, although I haven't found a lower limit yet so there's probably no hinderance to decreasing it much further if you wanted to.

> using a thread per driver, and that
> thread ill also sleep most of the time.

Actually, that depends on the situation.  With a working server and low RTT, yes, threads will be asleep most of the time, only to be awoken for short periods of activity when they're needed.

But when half of the connections are unavailable, due to for example a firewall, 50% of the threads will actually be running.  If a server is completely gone, it may be as much as 100% of the threads that are active.

(Having 100% of the threads active does not mean that they're actually consuming CPU cycles.  In the case of every thread active, they'll be sitting in a blocking socket call waiting for a ping response, so even then they'll still consume ~0% CPU.  But none of them will be in the Sleep state in this situation.)

The scheduler will have a few more threads to juggle, yes.  I imagine that the CPU scheduler is pretty darn efficient when it comes to keeping sleeping and io-blocked threads out of the scheduling loop, as far as I can imagine this is not a problem.  I haven't seen any issues here while testing, either.

It is a necessity to have one thread per driver given how the current driver works.  I didn't want to rewrite the whole driver, which is why I settled on this solution.

It would be possible to do a later rework of the driver and pooling such that there is primitives in the driver for doing non-blocking BeginConnect() and BeginPing(), and just one thread in the pooling algorithm that handles every single driver.  Bigger change, less likely to be accepted ;-), but definitely possible.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/79964

723 Vladislav Vaintroub	2009-08-04
      Bug #40684 Connector/NET does not obey Connect Timeout.
      The actual problem in the bug report is that pooled connections are dropped  by firewalls due to long inactivity and that results in longer connect times.
      
      The patch introduces 2 new parameters: "keepalive" (boolean, sets keepalive option on socket) and "keepalive timeout" (integer, number of seconds before first keepalive packet is sent).  "keepalive timeout" does not automatically switch "keepalive" on, both options need to be specified to set the timeout.
      
      One can use this parameters to prevent inactivity drops and also to detect broken connections faster.
      Note that  "Keepalive timeout"  is not currently available in Compact Framework nor in Mono ( will be silently ignored there and connection will use the default timeout what is normally 2 hours)

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/80086

738 Vladislav Vaintroub	2009-08-04
      Bug #40684 Connector/NET does not obey Connect Timeout.
      The actual problem in the bug report is that pooled connections are dropped  by
      firewalls due to long inactivity and that results in longer connect times.
            
      The patch introduces new parameter: "keepaliv" (integer, number of seconds before first keepalive packet is sent). 
      One can use this parameters to prevent inactivity drops and also to detect broken
      connections faster. Note that setting non-default timeout  is currently not possible with  Compact Framework nor with in Mono. Here, if keepalive option is used, the timeout would be defined by the OS default, which is normally 2 hours.

Hi David,
the keepalive option is now in the repository. It is hosted in Launchpad this days, so you need to install bzr and do 
bzr branch lp:connectornet/6.0
to get it.

(you can try  bzr branch lp:connectornet/6.1 or just bzr branch lp:connectornet
to get trunk)

The new option that is aimed to fix your problem is "keepalive=<seconds>".
with value of 1 for seconds I get the IO exception in your test case almost immediately. Admittedly, it is not how the option should be used in your case, as you just want to prevent firewalls to do bad things with the connection, so you might want to have timeouts longer that 1 second to reduce network traffic.

The general fix for timeouts is a tick larger and it will probably go into 6.2. The patch for it is on http://lists.mysql.com/commits/79914, it has not been reviewed yet and hence not pushed. If you want to play with it, apply against current trunk (i.e bzr branch lp:connectornet)

Regarding large number of threads- I do not think I can judge about how schedulers deal with sleeping threads, and I noticed you used  64K stacks,  is the minimum possible on Windows. But I do think unless there is a very good reason for it, we should follow general recommendations and use thread pool for background tasks when possible (when not possible, maybe own thread pools with thread = <small constant>*CPUs)

An entry was added to the 6.0.5 and 6.1.1 changelogs:

When reusing a connection from the connection pool, Connect Timeout was not obeyed when trying to reopen the connection. Although Connect Timeout was set to 3 seconds, it took 53 seconds before the call to Open() timed out.

> I noticed you used  64K stacks,
> is the minimum possible on Windows.

Could be.  Judging by the documentation, if the interpreter happens to run dotnet threads as native OS threads rather than user-mode threads, and the OS granularity dictates a minimum of 64 KiB.  The documentation tends to be imprecise, and I haven't tested how the de-facto Microsoft implementation works on any version of MS-Windows, so I don't know for sure what the minimum is.

> But I do think unless there is a very
> good reason for it, we should follow
> general recommendations and use thread
> pool for background tasks when possible
> (when not possible, maybe own thread pools
> with thread = <small constant>*CPUs)

My goal was a solution that's guaranteed to work.  Thread pools introduce added complexity and overhead that makes it harder to predict how the final algorithm executes, which is why I avoided it.

Fast async primitives for connect and ping would allow a logic change so that only one thread is needed for handling drivers, completely eliminating the need for a thread pool, and I think that's a better way.  If avoiding threads indeed turns out to be desirable under actual usage.

> bzr branch lp:connectornet/6.0 to get it.

Do you have a buildbot that produces binaries somewhere?

> The new option that is aimed to fix your
> problem is "keepalive=<seconds>".  with
> value of 1 for seconds I get the IO exception
> in your test case almost immediately.

Okay, so your approach fixes this problem but causes another one: increased network traffic.

Empty packets, but still one per MySQL connection per second, and each packet still has to be handled by any intermediary device that handles traffic on a per-packet or higher level.

Doesn't seem fair to our network administrators to cause such a massive overhead, especially considering that we have a fine solution that works without it.

> Admittedly, it is not how the option should be used in
> your case, as you just want to prevent firewalls to do
> bad things with the connection, so you might want to
> have timeouts longer that 1 second to reduce network
> traffic.

I assume that by increasing the timeout, you would again cause the test case in this issue to fail, and that's why you selected 1 second, is this correct?

> The general fix for timeouts is a tick larger
> and it will probably go into 6.2.

Funny story, I heard the exact same said about this issue and 6.0 a few months before that version was released :-).

On a serious note, why are you closing this issue if it's not fixed yet until 6.2?

> Do you have a buildbot that produces binaries somewhere?
No, unfortunately launchpad does not provide build bots (yet?). I can send you the dll if you wish.

>Okay, so your approach fixes this problem but causes another one:
increased network traffic

Exactly as yours:) You're using driver.Ping() on inactivity, I'm using keepalive packet... The keepalive timeout is configurable, so network traffic is controlable. The 1 second keepalive is an example that I would not recommend anyone to use when it should only stop firewalls. Maybe the 2 hours are ok, maybe 30 minutes or maybe 10 minutes, I do not know. People  get a parameter they can adapt for their environments.

>I assume that by increasing the timeout, you would again cause the test
>case in this issue to fail, and that's why you selected 1 second, is
>this correct?

I use 1 sec  to verify that keepalive works as advertised.I like shorter tests:)

>On a serious note, why are you closing this issue if it's not fixed yet
until 6.2?

Well, if you read the bug title and the bug description you'll see slight descrepancy. The bug title says "connection timeout is not obeyed", the longer description says "firewalls drop connection because of inactivity". keepalive fixes the second problem, which I understood is the real problem in your case (and which your own patch was aimed to fix). We normally do not solve more than a single problem at once. So I'm closing this bug, it fixes problem in "long description".

The blueprint for general timeout handling is created here
https://blueprints.launchpad.net/connectornet/+spec/new-command-timeout. It also has the code attached to it and you get the branch with
bzr branch lp:~vvaintroub/connectornet/timeouts  This blueprint is dedicated solely to timeouts. It is not about idle connections/firewalls nor about pooling (just in case:)

But sure enough, you can also open another bug if the fix with keepalive parameter does not solve the problem you _described_ (which is not exactly the title of this bug:)

> I can send you the dll if you wish.

Thanks, please do!

>> Okay, so your approach fixes this problem
>> but causes another one: increased network traffic
>
> Exactly as yours:)

I disagree, there's a considerable difference between the defaults suggested: a ping every 30 seconds versus one every second.

>> I assume that by increasing the timeout, you would
>> again cause the test case in this issue to fail,
>> and that's why you selected 1 second, is this correct?
>
> I use 1 sec  to verify that keepalive works as advertised.

That didn't answer the question though: is the issue fixed or not if the keepalive is set to a reasonable value rather than 1 second?

>> On a serious note, why are you closing
>> this issue if it's not fixed yet until 6.2?
>
> Well, if you read the bug title and the bug
> description you'll see slight descrepancy. The
> bug title says "connection timeout is not obeyed",
> the longer description says "firewalls drop
> connection because of inactivity".

That's a bit of a misunderstanding of how issue reporting is done.  It's not a "long description" that you're quoting, it's a "test case".

A test case may not completely, accurately or in whole represent the issue at hand.  It is merely a piece of code provided as proof that there is an actual problem.

(With simpler issues, the test case may happen to cover the entirety of the problem with complete accuracy.  That's more of a lucky circumstance.)

> keepalive fixes the second problem, which I
> understood is the real problem in your case
> (and which your own patch was aimed to fix).

The end goal is that every single Connector/NET timeout should work as advertised.  That is what the patch posted in this issue accomplishes.

It also represents an overall algorithm that provides a much better guarantee for warm connections than the currently used algorithm.

> We normally do not solve more than a single
> problem at once. So I'm closing this bug, it
> fixes problem in "long description".

Okay.  While fixing the test case rather than the underlying design problem doesn't really solve my issue, I guess it's fair enough to close an issue once there's no longer any evidence that a problem exists.

> The blueprint for general timeout handling is created here
> https://blueprints.launchpad.net/connectornet/+spec/new-command-timeout.

I'm glad to see that you're working on it!

(I fail to understand why you'd want to re-solve a problem that is already solved by the patch submitted in this issue, rather than working with the patch, but whatever.)

> But sure enough, you can also open another
> bug if the fix with keepalive parameter does
> not solve the problem you _described_ (which
> is not exactly the title of this bug:)

I think I've covered this further above.  In any case, here is a more accurate description for this issue:

"Every single one of the specifiable timeouts in Connector/NET fail to work under a myriad of various common circumstances."

"I'll refrain from posting a test case.  It should be obvious to anyone that with the sheer amount of fail given by the description, this is not a simple bug but a completely broken design."

I hope that the revised description can provide you with a clear perspective on what the problem at hand is :-).

David, hopefully we can continue this discussion offline withou hijacking the bug that is in documenting state.

The dll is sent, together with source.

>>> Okay, so your approach fixes this problem
>>> but causes another one: increased network traffic
>>
>> Exactly as yours:)

>I disagree, there's a considerable difference between the defaults
>suggested: a ping every 30 seconds versus one every second.

I did not suggest that. Again, this is the value I used for quick testing, nothing more implied. If you want 30 seconds timeout, use it, it is possible. ´

>That didn't answer the question though: is the issue fixed or not if
>the keepalive is set to a reasonable value rather than 1 second?

IF all timeouts issue would be fixed with this patch I would not need to create blueprints. OF course, if I set the timeout to larger value, broken connection is found later.In your proposed solution, there is neither magic that would determine a broken connection instantl (you'd need to ping the connection instantly for this)

The firewall dropping connections should be fixed though.  I re-read the bug description. It tells a story of firewalls dropping connections and repeats the word "firewall" 4 times, so I have hard time to believe it was not your issue. 

The patch you attached fixes firewall problem by keeping connection alive, and it is using driver.Ping() is not it? Besides it Ping()'s to detect broken connections. So now what TCP keepalive is designed for? Exactly for the same thing, detecting dead peers and preventing connection drops.It does not need many threads, it does not need programming efforts and moreover, I can rely on the fact that keepalive doesnot hang. 

I cannot rely on driver.Ping() that it does not hang. Why not? a Ping is a network send() and a network recv(). And if connection is broken while it recv()'s , TCP stack is not guaranteed to abort your connecton. It can hang infinitely. Please check the relevant discussion http://forums.sun.com/thread.jspa?threadID=539297&start=0&tstart=0

So now imagine unlucky situation where all of the numerous threads in your implementation Ping and all of them hang.

Contrast with keepalive timeout. Even if connection is half-open and hangs in recv(), TCP will abort the connection after failing retransmissions.

So now I hope I explained why keepalive helps not only to keep the connection online but also to resolve hangs. This option however is not a full magic, to support ConnectionTimeout and CommandTimeout we need something better  and this is the work I mentioned It cannot be resolved by having application level Ping's as in the patch attached. This needs to be solved be setting read and write timeouts on sockets instead and let the TCP stack handle those issues.

> hopefully we can continue this discussion
> offline withou hijacking the bug

Ah yes, definitely, this is hardly the place for discussing what a test case is and is not.

> IF all timeouts issue would be fixed with
> this patch I would not need to create blueprints.

Unsure which one of the patches we are talking about here.
I believe all timeout issues are fixed in the patch I submitted, at least.

> I cannot rely on driver.Ping() that it does not
> hang. Why not? a Ping is a network send() and a
> network recv(). And if connection is broken while
> it recv()'s , TCP stack is not guaranteed to abort
> your connecton. It can hang infinitely.

Not if you're using my patch.  There's a network receive timeout which is used to ensure exactly that this does not happen.

> So now imagine unlucky situation where all of the
> numerous threads in your implementation Ping and
> all of them hang.

If the network is completely gone, each thread will block for exactly [Connect Timeout] seconds.  After that, an Exception is delivered to the application, if it has acquired the particular driver that is failing.  If not, the driver is simply removed from the pool.

> Contrast with keepalive timeout. Even if connection
> is half-open and hangs in recv(), TCP will abort the
> connection after failing retransmissions.

End result is the same, except that the patch submitted in this issue has one less timeout value for the user to keep track of.

> So now I hope I explained why keepalive helps not only
> to keep the connection online but also to resolve hangs.
> This option however is not a full magic, to support
> ConnectionTimeout and CommandTimeout we need something
> better and this is the work I mentioned It cannot be
> resolved by having application level Ping's as in the
> patch attached. This needs to be solved be setting read
> and write timeouts on sockets instead and let the TCP
> stack handle those issues.

Sure it can?  Try the patch, it's already solved.

Oh, I see. It is really better to send patches as unified diffs,I was stuck in ConnectionPooling, and you got a lot of different other changes as well.

To the connection pooling: Am I right saying that the threads merely simulate in application non-configurable 10 seconds TCP keepalive? Instead we can do user-configurable TCP keepalive, that by default not switched at all, without  threads? You also do clean up for connections that are expired, but that again does not justify using a thread per connection, IMO. Oracle .NET connector documents that it cleans up connections with expired ConnectionLifetime every 3 minutes, I bet they use an ordinary Timer.

To the timeouts, yes, I see now you use them in
 baseStream.ReadTimeout = left
in MySqlStream.cs

The problems here is that not work for named pipes and shared memory streams, that also need to support command timeouts. This would throw NotImplemented exception.

> It is really better to send patches as unified diffs,

Code based on http://svn.mysql.com/svnpublic/connector-net/tags/5.2.6, revision 1591.  A unified diff is produced by dumping the submitted archive on top of a checkout of that revision, and running "svn diff".

> non-configurable 10 seconds TCP keepalive?

Actually, looking at the code again, the timeout is configurable.

> Instead we can do user-configurable TCP keepalive,
> that by default not switched at all, without threads?

Definitely, sounds good.  I agree that the keepalive patch is a very positive step in the right direction.

Only problem is it doesn't fix other timeout problems addressed by the patch submitted in this issue, so still looking forward to a complete solution including the new-command-timeout blueprint.

> You also do clean up for connections that are expired, but that
> again does not justify using a thread per connection, IMO.

Right, agreed, the threads are just there for satisfying the driver's blocking calls.  Cleanup code and other driver management were put in the same place for readability, but could surely have been relocated elsewhere.

> To the timeouts, yes, I see now you use them [...]
> The problems here is that not work for named pipes
> and shared memory streams [...]

Neither does TCP keepalives, I imagine, but yes you're right (I think - haven't tested).

> This would throw NotImplemented exception.

Okay, that has to be fixed.  I'll add a correction to disable setting the timeout for np and shm.  Thanks!

Changelog updated as requested:

In the case of long network inactivity, especially when connection pooling was used, connections were sometimes dropped, for example, by firewalls.

Note: The bugfix introduced a new keepalive parameter, which prevents disconnects by sending an empty TCP packet after a specified timeout.

> In the case of long network inactivity,
> especially when connection pooling was used,
> connections were sometimes dropped,
> for example, by firewalls.

Hi Tony,

Hopefully the committed code also fixes the 53-second timeout when there should be a 3-second timeout such as described by the test case in this issue.