Description:
On a development Enterprise monitor server the setup ran out of disk space. This obviously is the fault of the person who installed it (me). I resolved the problem by adding more and restarted MEM. As might be expected on startup the different agents are shown as being down so there's a lot of "red" on the dashboard indicating issues. This is to be expected. On startup the load on the dashboard is very high, and on the server concerned it takes quite a while to get up and running again.
However, as the agents start to report in it is very hard to see the recovery process, that is see state changes where for example you see that the agent is reported down but then see that it is up, and the same for the "database" connections each agent manages.
This makes it hard to determine if MEM is really recovering or not. All you can see are a lot of errors and you have to wait for the screen to refresh and hope that the number of errors goes down.
How to repeat:
see above.
Suggested fix:
There are log files (visible from the MEM dashboard) but they can not be easily queried to see this.
It would be nice to see some sort of time based logging where at a very high level you can see if an agent's reachability status changes, or if the reachability of a monitored server changes. These 2 things are at least the most high level indications of things working correctly, and it comforting on a system that was not working to see it "recover", something like (very high overview):
<timestamp> server1 agent ..... : agent is now connected
<timestamp> server1 <mysql_instance_x> : connection reestablished.
<timestamp> server2 agent ..... : agent is now connected
connected / connection [re]established etc may not be the correct way to express this but what this allows you to see is that the monitoring of these servers/mysql instances is working correctly (or not).