Bug #52952 If Agent gets a timeout on initial checkin, it will not retry
Submitted: 19 Apr 2010 18:45 Modified: 17 Aug 2010 10:42
Reporter: Diego Medina Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Enterprise Monitor: Agent Severity:S2 (Serious)
Version:2.2.0.1695 OS:Any
Assigned to: Darren Oldag CPU Architecture:Any

[19 Apr 2010 18:45] Diego Medina
Description:
If the service manager is busy (or the network between the agent and service manager is busy) while the agent does the initial checkin, the agent could get a timeout, and it will not re try to send that packet.

What;s the problem with that?, One of the many side effects is that even though the mysqld the agent is monitoring is up, the dashboard will show that instance as down, because the server.reachable is not being sent by the agent.
Because the service manager took longer that the agent timeout to process the "heartbeat?", resulting in the agent and service manager ending on different states.

For more information you can see bug

http://bugs.mysql.com/bug.php?id=52726

After Agent start, dashboard shows mysqld as down on large deployments for hours

How to repeat:
It is a little hard to reproduce, but basically do this:

1- Install and start the service manager
2- Start many agents once after the other (on my large server, 74 agents are enough, you may need more or less agents)
3- Notice that some instances will show as down for at least 12 hours (until the next re-inventory)

Suggested fix:
Do a retry?
[19 Apr 2010 18:49] Enterprise Tools JIRA Robot
Diego Medina writes: 
The server.reachable started to be sent only after I went to the UI and forced a re-inventory.
[19 Apr 2010 18:49] Enterprise Tools JIRA Robot


Attachment: 10350_chassis.log.gz (application/x-gzip, text), 172.27 KiB.

[27 May 2010 23:19] Enterprise Tools JIRA Robot
Darren Oldag writes: 
	
EM-4279 is pretty much a duplicate of this bug, not just 'related'
[28 May 2010 14:30] Enterprise Tools JIRA Robot
Jan Kneschke writes: 
* if the agent closes his connection after the 120sec timeout, doesn't the server get a exception when it tries to write ? Could it be handled on the server side, by forcing a resync ?
* the patch does it's job, just some cosmetics:

... static int network_xml_parse_tasks(xmlNode *agentNode, GAsyncQueue *rcvq, const GString *agent_id, struct network_io_config_t *io_config) {

Instead of passing the struct down, only pass the GString * task_sequence down OR don't pass the 'const GString *agent_id' down and take it in the function from the struct. I prefer the 1st.

* As this is a task-sequence, we should actually check on it: Did it change since the last one by 0 or 1 (did it decrement ? did it fast forward ?), if not it will be an error which should be handled with a resync.
* for that we need to know what kind of integer it is to know when it wraps.
[28 May 2010 19:09] Enterprise Tools JIRA Robot
Darren Oldag writes: 
revision-id: oldag@mysql.com-20100528190237-vv0vzasy81q1xtqe
parent: marcos.palacios@sun.com-20100527212843-rwerh308fxtar1dl
committer: Darren L. Oldag <oldag@mysql.com>
branch nick: Monitor22
timestamp: Fri 2010-05-28 14:02:37 -0500

revision-id: oldag@mysql.com-20100528185749-vo5lth6zngrv3olx
parent: michael.schuster@oracle.com-20100527111422-rnmpp5mf1twhokj3
committer: Darren L. Oldag <oldag@mysql.com>
branch nick: Agent22
timestamp: Fri 2010-05-28 13:57:49 -0500
[7 Jun 2010 23:30] Enterprise Tools JIRA Robot
Andy Bang writes: 
In build 2.2.2.1722.
[1 Jul 2010 13:44] Enterprise Tools JIRA Robot
Diego Medina writes: 
It has been very hard to reproduce, so we are closing this bug as resolved but note that it may come back.

If a customer seems to have this issue, make sure they are using both, the agent and service manager with the fix (it requires both components to be updated)
[5 Jul 2010 7:23] MC Brown
A note has been added to the 2.2.2 changelog: 

        If a &merlin_agent; got a timeout during the initial checkin                                                                                       
        with &merlin_server; (for instance, if &merlin_server; was                                                                                         
        busy), it would fail to resynchronize properly and show the                                                                                        
        monitored MySQL instances as down.
[29 Jul 2010 23:23] Enterprise Tools JIRA Robot
Andy Bang writes: 
In build 2.2.3.1734.
[16 Aug 2010 15:04] Enterprise Tools JIRA Robot
Diego Medina writes: 
Verified fixed in 2.2.3.1734.
[17 Aug 2010 10:42] MC Brown
A note has been added to the 2.2.3 changelog: 

        check-in by &merlin_agent;, the monitored instance could be                                                                                        
        identified as down.