MySQL Bugs: #52726: After Agent start, dashboard shows mysqld as down on large deployments for hours

Bug #52726	After Agent start, dashboard shows mysqld as down on large deployments for hours
Submitted:	9 Apr 2010 16:52	Modified:	28 May 2010 15:54
Reporter:	Diego Medina	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Enterprise Monitor: Server	Severity:	S3 (Non-critical)
Version:	2.2.0.1687	OS:	Any
Assigned to:	Darren Oldag	CPU Architecture:	Any

Description:
I am running a deployment that monitors 250 servers, if I start 10 of them at the time, it all goes well. but if I start 50 agents at once, some of them will show on the dashboard with a line across meaning that the mysqld instances are down (which is not true)

The agent do connect to the mysqld servers.

How to repeat:
1- Monitor 50 mysqld servers
2- start 50 more all at once
3- Boom, at least 20 will show offline for hours (maybe days, too soon to tell)

If you wonder why would anyone start 50 agents at once, consider the customers who would deploy all their agents using things like puppet, deploy and start them all in parallel.

Josh Sled writes: 
agents: asator01
server: tyr56.norway

Diego Medina writes: 
Back at me so that I can do two things:

1- Try to reduce the number of agents it takes to reproduce this bug
2- Once 1 is completed, get the agent in debug log, as well as the service manager's log and keep everything running so that the developers can look at it.

Diego Medina writes: 
More things I found out:

One of the servers that showed as down had only these entries:

{noformat}
$ cat sandboxes/Simple/agent100/chassis.log | ggrep reach
     <attribName>server.reachable</attribName>
2010-04-15 20:32:32: (critical) network-io.c:268: curl_easy_perform('https://agent:mysql@tyr56.norway.sun.com:28443/heartbeat';) failed: Operation timed out after 120000 milliseconds with 0 bytes received (curl-error = 'Timeout was reached' (28))
{noformat}

And after I clicked on "Refresh inventory" on UI->Settings -> Manager Servers -> Server name  popup I saw new entries on the log:

{noformat}
     <attribName>server.reachable</attribName>
            <attribute><![CDATA[server.reachable]]></attribute>
            <attribute><![CDATA[server.reachable]]></attribute>
2010-04-15 20:44:50: (debug) scheduler.c.524: scheduling collect_mysql for mysql::server->server.reachable
      <attribute>server.reachable</attribute>
      <attribute>server.reachable</attribute>
      <attribute>server.reachable</attribute>
{noformat}

and this server shows online now.

Diego Medina writes: 
Agent log in debug



Attachment: 10341_chassis.log.gz (application/x-gzip, text), 172.27 KiB.

Diego Medina writes: 
Agent 2.1.2.1160 and dashboard 2.2 show the same problem

Diego Medina writes: 
Agent 2.1.2.1160 and dashboard 2.1.2.1158  do *not* show this problem

All 74 agents fully check in in 3-4 minutes, and there is no "page load time of 2 minutes" while all the agents are checking in. (which results in no timeout for the agent which results is everyone is happy :)

Diego Medina writes: 
Disregard what I wrote about the 2.1.2 tests. It turns out I need more than just 74 servers to check in to reproduce this bug. I am working on getting the exact steps.
I do have a 2.2 dashboard that reproduces the problem, out of those 74 servers, it has 4 servers that are under heavy quan (service manager quan on one, sugarCRM on another server, and a test app sending 500 canonical queries per minute)