MySQL Bugs: #45471: retry connecting to a lost MEM mysqld

Bug #45471	retry connecting to a lost MEM mysqld
Submitted:	12 Jun 2009 15:11	Modified:	28 Jul 2009 13:38
Reporter:	Lig Isler-Turmelle	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Enterprise Monitor: Server	Severity:	S4 (Feature request)
Version:	2.1	OS:	Any
Assigned to:	Sloan Childers	CPU Architecture:	Any
Tags:	windmill

Description:
Currently if the MEM mysqld backend is down (and no proxy is being used), tomcat will retry to access the mysqld for a while (50 times or > 180 seconds whichever comes first).

We think it should take longer. For example, we have a network outage (say a failed firewall).  This breaks connectivity between the MEM server and it's mysqld thus shutting tomat (and hence ALL front-end reporting) down.  

We would like tomcat to keep trying rather then shut down. Even if after the first "rush" to reconnect, you then only try once every 30 seconds or so... being sure to log the problem of course.

How to repeat:
shut down MEM's mysqld leaving the rest up and running.

Suggested fix:
let MEM have more time to reconnect to the database.

This keep retrying is to make the system more resilient in the event of an unexpected failure. If mysql goes away then the DBA only has to worry about getting it up again, and does not have to check that merlin is still running. (He would expect it to do as well as possible under the circumstances but to recover once the database is reachable again)

The comment about the delayed retry interval is to avoid tomcat generating a "connect storm" which if the database does not reside locally on the same host could be something you want to avoid.

We'll make the timeout(s) configurable, because it can't be "forever" as the application will spool data or run out of resources eventually, and then die, however different deployments will of course have different constraints.

Notice that the retry code currently is for every time a connection is pulled from the connection pool, so this isn't just a startup condition. If mysqld is gone, and retries fail, then the calling thread (agent, or UI), will get an exception, and be able to retry. Making the timeout

(last comment got truncated)

We'll make the timeout(s) configurable, because it can't be "forever" as the application will spool data or run out of resources eventually, and then die, however different deployments will of course have different constraints.

Notice that the retry code currently is for every time a connection is pulled from the connection pool, so this isn't just a startup condition. If mysqld is gone, and retries fail, then the calling thread (agent, or UI), will get an exception, and be able to retry. Making the timeout configurable will keep the application from keeling over with out-of-memory or thread errors, while being flexible to those who can trade memory for longer timeouts.

Could make it user-configurable, but would require some tricky testing to see how long DB could be unreachable before hitting some tipping point (would not be safe to retry indefinitely). Will consider implementing & documenting as a "user beware!" setting.

BUG#45471 retry connecting to a lost MEM mysqld
- make the db connect retries and timeout configurable via config.properties
- mysql.max_connect_retries
- mysql.max_connect_timeout_msec
- the default retry count is currently 50 (unchanged)
- the default retry timeout is currently 180 seconds (unchanged)
- the way this works is whichever runs out first... number of retries      
OR number of msecs attempting retries

Patch installed in versions => 2.1.0.1063.

An entry has been added to the 2.1.0 changelog:

If the Service Manager lost connection to the repository server, it would shut down after 50 attempts to reconnect or if it was unable to reconnect within 180 seconds. This behavior has now been made configurable through parameters in the config.properties file. The parameters are:

* mysql.max_connect_retries - default is 50.

* mysql.max_connect_timeout_msec - default is 180 seconds.