Bug #45382 Agent loops in resynchronize task (race condition)
Submitted: 8 Jun 2009 20:29 Modified: 14 Jan 2010 14:55
Reporter: Diego Medina Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Enterprise Monitor: Agent Severity:S3 (Non-critical)
Version:2.0.5, 2.1.0.1059 OS:Any
Assigned to: Kay Roepke CPU Architecture:Any
Tags: windmill

[8 Jun 2009 20:29] Diego Medina
Description:
Note: *This is for documenting purposes, we are yet to find a reliable way to reproduce this bug, but we have seen it a few (several?) times*

Also see bug#40729

* There is a situation where the agent reconnects to the dashboard, sends a resync task, gets a new agent-id (not to confuse with agent-uuid), this is more like a session id. And it also gets a "resync" task (<taskId>-1</taskId>)

* if a heartbeat comes in between, that triggers a new resync.

And again, the agent sends a new rezync task, gets a new agent-id, and again in the loop, until it eventually goes to normal.

From the outside, you may notice how some mysql instances can take a long time to show on the UI as online (you see them on the UI with a line across the name)

How to repeat:
Because this is a timing issue, we do not have a way to reproduce it, but more or less you can try:

1- Install and start the service manager and agent
2- Let discovery happen (so that the mysqld instance is on the UI)
3- Restart tomcat (the service manager)
4- if you are looking at the agent logs, you can do something like this:

$ tail -n200  -f /Applications/mysql/enterprise/2.1/agent/mysql-monitor-agent.log  | grep setting

and you will see something like:

(debug) network-xml.c:527: resetting agent-id to 1244489824410.2 (was -1)
(debug) network-xml.c:527: resetting agent-id to 1244489830410.3 (was 1244489824410.2)
(debug) network-xml.c:527: resetting agent-id to 1244489836397.4 (was 1244489830410.3)
(debug) network-xml.c:527: resetting agent-id to 1244489842395.5 (was 1244489836397.4)
(debug) network-xml.c:527: resetting agent-id to 1244489848387.6 (was 1244489842395.5)
(debug) network-xml.c:527: resetting agent-id to 1244489854387.7 (was 1244489848387.6)
(debug) network-xml.c:527: resetting agent-id to 1244489860388.8 (was 1244489854387.7)
(debug) network-xml.c:527: resetting agent-id to 1244489866405.9 (was 1244489860388.8)
(debug) network-xml.c:527: resetting agent-id to 1244489872395.10 (was 1244489866405.9)
(debug) network-xml.c:527: resetting agent-id to 1244489878400.11 (was 1244489872395.10)
(debug) network-xml.c:527: resetting agent-id to 1244489884404.12 (was 1244489878400.11)
(debug) network-xml.c:527: resetting agent-id to 1244489890405.13 (was 1244489884404.12)

Notice the new agent-id many times per second.
[8 Jun 2009 20:31] Diego Medina
This is the debug log, clean, from the time the agent started, it went straight into the loop. And it stayed in the loop until I

Attachment: rezync loop log from start - good one.gz (application/x-gzip, text), 112.99 KiB.

[8 Jun 2009 20:34] Diego Medina
Last comment was supposed to be:

This is the debug log, clean, from the time the agent started, it went straight into the
loop. And it stayed in the loop until I killed the agent.
[12 Sep 2009 2:30] Donna Harmon
I ran several tests with the following configuration and was able to reproduce, 4 of 5 times, the problem when restarting the monitor, where the servers do not display in the dashboard for over 15 minutes while the agents display almost immediately:

Ubuntu VM
Monitor 2.0.6.71
Agent 2.0.6.7159
Single Agent - 31 5.1.35 instances monitored

(credit for Darren Oldag for spotting these entries in the logs and below input)
The agent gets into a state where it is not honoring the 'resync protocol' with the MEM server.

in the logs, i see several situations like this:

2009-09-08 13:12:01: (debug) network-io.c:1097: skipping heartbeat without resync response, resync_state RESYNC_NEED_AGENT_ID
2009-09-08 13:12:34: (debug) last message repeated 10 times

[...repeat...]

2009-09-08 14:29:37: (debug) last message repeated 10 times
2009-09-08 14:29:38: (debug) network-io.c:1097: skipping heartbeat without resync response, resync_state RESYNC_NEED_AGENT_ID

and/or ...

2009-09-06 15:20:40: (critical) scheduler.c:778: in list_known_data_items state
2009-09-06 15:20:49: (critical) last message repeated 101 times

[...repeat...]

2009-09-06 15:30:07: (critical) scheduler.c:778: in list_known_data_items state
2009-09-06 15:30:38: (critical) last message repeated 66 times

until the agent gets out of the resync state properly, the server.reachable collection cannot be retrieved, and will not mark any of the mysqld instances as 'up'.
[18 Sep 2009 12:13] Andrii Nikitin
Here somewhat different error messages followed by 4min gap in all graphs:

2009-09-17 13:17:04: (critical) exception received from server: Internal error: known items are required for list_instances processing
2009-09-17 13:17:49: (critical) G:\bs\bs\merlin\agent-2.0\src\mysql-proxy-0.7.0r1489_20090811_1738\plugins\agent\network-io.c:841: starting task 6 for mysql::server[418fdade-6547-4ccd-9556-c85cedfbc99a]

According to Leith this is most probably the same problem
[20 Nov 2009 22:42] Enterprise Tools JIRA Robot
Sloan Childers writes: 
available in 2.1 build 1109
[18 Dec 2009 21:11] Enterprise Tools JIRA Robot
Diego Medina writes: 
Verified fixed on 2.1.1.1132
[14 Jan 2010 14:55] MC Brown
A note has been added to the 2.1.1 changelog: 

 The agent could get into a state where it loop through the resynchronization of the core data, without reporting information to the service manager, causing gaps in the data.
[1 Feb 2010 19:44] Enterprise Tools JIRA Robot
Diego Medina writes: 
Agent 2.2.0.1611 still has the fix.
[1 Feb 2010 19:56] Enterprise Tools JIRA Robot
Diego Medina writes: 
Agent 2.1.1.1142 still has the fix