Bug #45382 Agent loops in resynchronize task (race condition)
Submitted: 8 Jun 22:29 Modified: 20 Nov 23:42
Reporter: Diego Medina
Status: QA testing
Category:Monitoring: Agent Severity:S3 (Non-critical)
Version:2.0.5, 2.1.0.1059 OS:Any
Assigned to: Kay Roepke Target Version:
Triage: Needs Triage: R4 (High) / E4 (High)

[8 Jun 22:29] Diego Medina
Description:
Note: *This is for documenting purposes, we are yet to find a reliable way to reproduce
this bug, but we have seen it a few (several?) times*

Also see bug#40729

* There is a situation where the agent reconnects to the dashboard, sends a resync task,
gets a new agent-id (not to confuse with agent-uuid), this is more like a session id. And
it also gets a "resync" task (<taskId>-1</taskId>)

* if a heartbeat comes in between, that triggers a new resync.

And again, the agent sends a new rezync task, gets a new agent-id, and again in the loop,
until it eventually goes to normal.

From the outside, you may notice how some mysql instances can take a long time to show on
the UI as online (you see them on the UI with a line across the name)

How to repeat:
Because this is a timing issue, we do not have a way to reproduce it, but more or less
you can try:

1- Install and start the service manager and agent
2- Let discovery happen (so that the mysqld instance is on the UI)
3- Restart tomcat (the service manager)
4- if you are looking at the agent logs, you can do something like this:

$ tail -n200  -f /Applications/mysql/enterprise/2.1/agent/mysql-monitor-agent.log  | grep
setting

and you will see something like:

(debug) network-xml.c:527: resetting agent-id to 1244489824410.2 (was -1)
(debug) network-xml.c:527: resetting agent-id to 1244489830410.3 (was 1244489824410.2)
(debug) network-xml.c:527: resetting agent-id to 1244489836397.4 (was 1244489830410.3)
(debug) network-xml.c:527: resetting agent-id to 1244489842395.5 (was 1244489836397.4)
(debug) network-xml.c:527: resetting agent-id to 1244489848387.6 (was 1244489842395.5)
(debug) network-xml.c:527: resetting agent-id to 1244489854387.7 (was 1244489848387.6)
(debug) network-xml.c:527: resetting agent-id to 1244489860388.8 (was 1244489854387.7)
(debug) network-xml.c:527: resetting agent-id to 1244489866405.9 (was 1244489860388.8)
(debug) network-xml.c:527: resetting agent-id to 1244489872395.10 (was 1244489866405.9)
(debug) network-xml.c:527: resetting agent-id to 1244489878400.11 (was 1244489872395.10)
(debug) network-xml.c:527: resetting agent-id to 1244489884404.12 (was 1244489878400.11)
(debug) network-xml.c:527: resetting agent-id to 1244489890405.13 (was 1244489884404.12)

Notice the new agent-id many times per second.
[8 Jun 22:31] Diego Medina
This is the debug log, clean, from the time the agent started, it went straight into the
loop. And it stayed in the loop until I

Attachment: rezync loop log from start - good one.gz (application/x-gzip, text), 112.99 KiB.

[8 Jun 22:34] Diego Medina
Last comment was supposed to be:

This is the debug log, clean, from the time the agent started, it went straight into the
loop. And it stayed in the loop until I killed the agent.
[12 Sep 4:30] Donna Harmon
I ran several tests with the following configuration and was able to reproduce, 4 of 5
times, the problem when restarting the monitor, where the servers do not display in the
dashboard for over 15 minutes while the agents display almost immediately:

Ubuntu VM
Monitor 2.0.6.71
Agent 2.0.6.7159
Single Agent - 31 5.1.35 instances monitored

(credit for Darren Oldag for spotting these entries in the logs and below input)
The agent gets into a state where it is not honoring the 'resync protocol' with the MEM
server.

in the logs, i see several situations like this:

2009-09-08 13:12:01: (debug) network-io.c:1097: skipping heartbeat without resync
response, resync_state RESYNC_NEED_AGENT_ID
2009-09-08 13:12:34: (debug) last message repeated 10 times

[...repeat...]

2009-09-08 14:29:37: (debug) last message repeated 10 times
2009-09-08 14:29:38: (debug) network-io.c:1097: skipping heartbeat without resync
response, resync_state RESYNC_NEED_AGENT_ID

and/or ...

2009-09-06 15:20:40: (critical) scheduler.c:778: in list_known_data_items state
2009-09-06 15:20:49: (critical) last message repeated 101 times

[...repeat...]

2009-09-06 15:30:07: (critical) scheduler.c:778: in list_known_data_items state
2009-09-06 15:30:38: (critical) last message repeated 66 times

until the agent gets out of the resync state properly, the server.reachable collection
cannot be retrieved, and will not mark any of the mysqld instances as 'up'.
[18 Sep 14:13] Andrii Nikitin
Here somewhat different error messages followed by 4min gap in all graphs:

2009-09-17 13:17:04: (critical) exception received from server: Internal error: known
items are required for list_instances processing
2009-09-17 13:17:49: (critical)
G:\bs\bs\merlin\agent-2.0\src\mysql-proxy-0.7.0r1489_20090811_1738\plugins\agent\network-io.c:841:
starting task 6 for mysql::server[418fdade-6547-4ccd-9556-c85cedfbc99a]

According to Leith this is most probably the same problem
[20 Nov 23:42] Enterprise Tools JIRA Robot
Sloan Childers writes: 
available in 2.1 build 1109