Bug #23943 Heat Chart too slow to show "down" server
Submitted: 3 Nov 2006 14:31 Modified: 14 Dec 2006 4:20
Reporter: Carsten Segieth Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Enterprise Monitor: Server Severity:S2 (Serious)
Version:1.0.0 OS:Any (All)
Assigned to: Darren Oldag CPU Architecture:Any
Tags: heat chart, mer100 readme, up/down status

[3 Nov 2006 14:31] Carsten Segieth
Description:
It takes too much time until the dashboard shows the red bullet for the 'dead' server.

In my tests I had an average time of 2:15 min from stopping the server until the red bullet is shown.

The agent is running with heartbeat of 10 sec.
The dashboard screen refresh rate is 15 sec.
I have 3 agents connected, so there is not much load on the server.

I had similar time (less faster) with noticing that the server is up again.

How to repeat:
- start an agent
- let dashboard run in auto-refresh mode with 15 sec
- stop the monitired MySQL service
- check the time until the server is marked 'dead'

Suggested fix:
The dead server should be shown faster as 'dead' (Jan said today on IRC that 1 min is the limit).
[3 Nov 2006 14:45] Carsten Segieth
agent log (log-level = message)

Attachment: mysql-service-agent.3307.zip (application/zip, text), 23.63 KiB.

[3 Nov 2006 14:46] Carsten Segieth
here the times, the corresponding agent log is already attached:

14:30:00 stopped the agent (by mistake, but also a good test)
14:31:56 dashboard shows dead agent
   -----
    1:56 min

14:32:15 start agent
14:32:32 dashboard shows running agent
   -----
    0:17 min --> excellent time
    
14:33:00 stopped the MySQL server
14:35:15 dashboard shows dead server
   -----
    2:15 min
    
14:35:50 started the MySQL server
14:38:19 dashboard shows running server
   -----
    2:29 min

14:39:59 apache, tomcat, mysql log files and dumps saved, see https://intranet.mysql.com/~csegieth/merlin/VMXP2_*_2006-11-03-14.39.59,59.zip
[11 Dec 2006 16:45] Darren Oldag
the architecture of the advisor evaluation still had some cruft from when it was non-"datum callback" based.

fixes:

1) always update when a new datum is received.  then, check frequency and consistency to evaluate on THIS callback.
2) base the eval time on the Datum timestamp that caused the evaluation.  There were 'missing' evaluations due to the uncertainty of thread scheduling and using the system time as the evaluation time.

putting in these fixes means the FIRST time the server gets a down indication from the agent, it will evaluate as such and show the server down.  likewise, it will do the same for server up.  there is the potential to speed up the server.* status data collections, too, but right now it is still one minute.

NOTE:  additional fix (for grins)

honor the 'active' column from AgentMonitoring so a properly shut down AGENT will register down instantaneously.  it falls back to checking heartbeat interval just in case the agent disappeared in a non-normal fashion.  i had implemented this fix before in another patch, but that patch was never approved because it was mixed with something else.  but, this seems like the proper place and time to do it.
[14 Dec 2006 4:20] Bill Weber
Verified fixed in build 1.0.1.4391.