Bug #45667 cry for help no longer works for mysqld errors < 1026
Submitted: 23 Jun 2009 0:40 Modified: 21 Jul 2009 14:01
Reporter: Sloan Childers Email Updates:
Status: Closed Impact on me:
Category:MySQL Enterprise Monitor: Server Severity:S2 (Serious)
Version: OS:Any
Assigned to: Sloan Childers CPU Architecture:Any

[23 Jun 2009 0:40] Sloan Childers
Cry for help emails are no longer be sent for things like mysqld running out of connections or disk space.  We used to alarm on any mysqld error code < 1026.

How to repeat:
Kill the mysqld internal repository while the application is running and agents are connected.  Warning... don't hold your breath waiting for notification.

Suggested fix:
In testing simply killing the internal repository it looks like a few things are affected:

1)  the ui (after waiting for 50 retries * 1 second to reconnect) finally displays a stack dump to the screen - seems like an excessive time to wait for the ui to tell me it can't possibly work.  maybe an error message with a "please retry in 30 seconds" would be more appropriate.

2) the data purge thread fails and logs the exception, but does not currently possess a cry for help notification object.  this code and the agent heartbeat code path are probably the best place to have the hooks since this code runs unattended.

3) when multiple connections are being attempted i think we may be excessively logging all failures (50 retries each) based on how much is showing up in stdout in the debugger

4) the SQLRuntimeException class that the cry for help filtering code uses are no longer thrown as of the switch to Hibernate (points to a historical lack of a good way to test this code)

5) since the ui is now blocking for up to 50 retries * 1 second we need to check and see what agents connections are doing...  i'm not sure blocking a lot of agents for 50 seconds is going to leave us in a recoverable state

After a quick glance in the code it looks like possibly all HibernateException, SQLException, and SQLRuntimeException exceptions may now qualify for cry for help filtering but I'm sure there are others.  

It would be nice to find/author a tool to help us track down places in our code that try to catch exceptions that are no longer thrown.
[23 Jun 2009 16:49] Enterprise Tools JIRA Robot
Gary Whizin writes: 
Per Leith: P2 for anyone who would normally hit these errors
[25 Jun 2009 6:08] Enterprise Tools JIRA Robot
Hudson Integration Agent User writes: 
Integrated in !https://repoman.mysql.com/hudson/nocacheImages/16x16/blue.gif! [ServiceManagerTrunk #799|https://repoman.mysql.com/hudson/job/ServiceManagerTrunk/799/]
     BUG#45667 JIRA  'cry for help no longer works for mysqld errors < 1026'
  In particular, add HibernateException type and consider 'critical'
[25 Jun 2009 6:14] Eric Herman
Thoughts on items listed above:

(4) "cry for help" now includes "HibernateException"; per conversation with Sloan, if it turns out there are some HibernateException which we should not notify on, we can narrow the field.

(1) The long delay in the UI may be an aspect of our retry policy; it is not immediately obvious to me how we might elevate that we're in a retry state to the browser without an invasive change.

(2) Have not yet added a notify object to data purge - the main benefit would be to cry for help if no agents are pinging in. Let's revisit for 2.2.

(3) excessive logging: The cry for help is not (today) designed to throttle number of times it notifies. If there are 100 agents, and the DB goes down, then each of those hundred agents pinging in will likely send an email. Again, I propose we revisit for 2.2.

5) I have not looked at (5) but I think, perhaps "UI retries block agents for too long" should be split off into a separate issue.
[1 Jul 2009 16:07] Sloan Childers
eric checked a patch into the 2.1.x tree that watches for Hibernate exceptions
[2 Jul 2009 21:33] Marcos Palacios
Verified fixed in service manager build
[21 Jul 2009 14:01] Tony Bedford
An entry was added to the 2.1.0 changelog:

Cry for help emails were not sent for events such as the monitored server running out of connections or disk space. In the past these had been sent for any error code < 1026.