Bug #52726 After Agent start, dashboard shows mysqld as down on large deployments for hours
Submitted: 9 Apr 2010 16:52 Modified: 28 May 2010 15:54
Reporter: Diego Medina Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Enterprise Monitor: Server Severity:S3 (Non-critical)
Version:2.2.0.1687 OS:Any
Assigned to: Darren Oldag CPU Architecture:Any

[9 Apr 2010 16:52] Diego Medina
Description:
I am running a deployment that monitors 250 servers, if I start 10 of them at the time, it all goes well. but if I start 50 agents at once, some of them will show on the dashboard with a line across meaning that the mysqld instances are down (which is not true)

The agent do connect to the mysqld servers.

How to repeat:
1- Monitor 50 mysqld servers
2- start 50 more all at once
3- Boom, at least 20 will show offline for hours (maybe days, too soon to tell)
[9 Apr 2010 17:38] Diego Medina
If you wonder why would anyone start 50 agents at once, consider the customers who would deploy all their agents using things like puppet, deploy and start them all in parallel.
[9 Apr 2010 21:26] Enterprise Tools JIRA Robot
Josh Sled writes: 
agents: asator01
server: tyr56.norway
[12 Apr 2010 18:49] Enterprise Tools JIRA Robot
Diego Medina writes: 
Back at me so that I can do two things:

1- Try to reduce the number of agents it takes to reproduce this bug
2- Once 1 is completed, get the agent in debug log, as well as the service manager's log and keep everything running so that the developers can look at it.
[15 Apr 2010 18:49] Enterprise Tools JIRA Robot
Diego Medina writes: 
More things I found out:

One of the servers that showed as down had only these entries:

{noformat}
$ cat sandboxes/Simple/agent100/chassis.log | ggrep reach
     <attribName>server.reachable</attribName>
2010-04-15 20:32:32: (critical) network-io.c:268: curl_easy_perform('https://agent:mysql@tyr56.norway.sun.com:28443/heartbeat';) failed: Operation timed out after 120000 milliseconds with 0 bytes received (curl-error = 'Timeout was reached' (28))
{noformat}

And after I clicked on "Refresh inventory" on UI->Settings -> Manager Servers -> Server name  popup I saw new entries on the log:

{noformat}
     <attribName>server.reachable</attribName>
            <attribute><![CDATA[server.reachable]]></attribute>
            <attribute><![CDATA[server.reachable]]></attribute>
2010-04-15 20:44:50: (debug) scheduler.c.524: scheduling collect_mysql for mysql::server->server.reachable
      <attribute>server.reachable</attribute>
      <attribute>server.reachable</attribute>
      <attribute>server.reachable</attribute>
{noformat}

and this server shows online now.
[15 Apr 2010 18:53] Enterprise Tools JIRA Robot
Diego Medina writes: 
Agent log in debug
[15 Apr 2010 18:53] Enterprise Tools JIRA Robot


Attachment: 10341_chassis.log.gz (application/x-gzip, text), 172.27 KiB.

[20 Apr 2010 2:45] Enterprise Tools JIRA Robot
Diego Medina writes: 
Agent 2.1.2.1160 and dashboard 2.2 show the same problem
[20 Apr 2010 3:56] Enterprise Tools JIRA Robot
Diego Medina writes: 
Agent 2.1.2.1160 and dashboard 2.1.2.1158  do *not* show this problem

All 74 agents fully check in in 3-4 minutes, and there is no "page load time of 2 minutes" while all the agents are checking in. (which results in no timeout for the agent which results is everyone is happy :)
[23 Apr 2010 19:26] Enterprise Tools JIRA Robot
Diego Medina writes: 
Disregard what I wrote about the 2.1.2 tests. It turns out I need more than just 74 servers to check in to reproduce this bug. I am working on getting the exact steps.
I do have a 2.2 dashboard that reproduces the problem, out of those 74 servers, it has 4 servers that are under heavy quan (service manager quan on one, sugarCRM on another server, and a test app sending 500 canonical queries per minute)