MySQL Bugs: #45382: Agent loops in resynchronize task (race condition)

Bug #45382	Agent loops in resynchronize task (race condition)
Submitted:	8 Jun 2009 20:29	Modified:	14 Jan 2010 14:55
Reporter:	Diego Medina	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Enterprise Monitor: Agent	Severity:	S3 (Non-critical)
Version:	2.0.5, 2.1.0.1059	OS:	Any
Assigned to:	Kay Roepke	CPU Architecture:	Any
Tags:	windmill

Description:
Note: *This is for documenting purposes, we are yet to find a reliable way to reproduce this bug, but we have seen it a few (several?) times*

Also see bug#40729

* There is a situation where the agent reconnects to the dashboard, sends a resync task, gets a new agent-id (not to confuse with agent-uuid), this is more like a session id. And it also gets a "resync" task (<taskId>-1</taskId>)

* if a heartbeat comes in between, that triggers a new resync.

And again, the agent sends a new rezync task, gets a new agent-id, and again in the loop, until it eventually goes to normal.

From the outside, you may notice how some mysql instances can take a long time to show on the UI as online (you see them on the UI with a line across the name)

How to repeat:
Because this is a timing issue, we do not have a way to reproduce it, but more or less you can try:

1- Install and start the service manager and agent
2- Let discovery happen (so that the mysqld instance is on the UI)
3- Restart tomcat (the service manager)
4- if you are looking at the agent logs, you can do something like this:

$ tail -n200  -f /Applications/mysql/enterprise/2.1/agent/mysql-monitor-agent.log  | grep setting

and you will see something like:

(debug) network-xml.c:527: resetting agent-id to 1244489824410.2 (was -1)
(debug) network-xml.c:527: resetting agent-id to 1244489830410.3 (was 1244489824410.2)
(debug) network-xml.c:527: resetting agent-id to 1244489836397.4 (was 1244489830410.3)
(debug) network-xml.c:527: resetting agent-id to 1244489842395.5 (was 1244489836397.4)
(debug) network-xml.c:527: resetting agent-id to 1244489848387.6 (was 1244489842395.5)
(debug) network-xml.c:527: resetting agent-id to 1244489854387.7 (was 1244489848387.6)
(debug) network-xml.c:527: resetting agent-id to 1244489860388.8 (was 1244489854387.7)
(debug) network-xml.c:527: resetting agent-id to 1244489866405.9 (was 1244489860388.8)
(debug) network-xml.c:527: resetting agent-id to 1244489872395.10 (was 1244489866405.9)
(debug) network-xml.c:527: resetting agent-id to 1244489878400.11 (was 1244489872395.10)
(debug) network-xml.c:527: resetting agent-id to 1244489884404.12 (was 1244489878400.11)
(debug) network-xml.c:527: resetting agent-id to 1244489890405.13 (was 1244489884404.12)

Notice the new agent-id many times per second.

This is the debug log, clean, from the time the agent started, it went straight into the loop. And it stayed in the loop until I

Attachment: rezync loop log from start - good one.gz (application/x-gzip, text), 112.99 KiB.

Last comment was supposed to be:

This is the debug log, clean, from the time the agent started, it went straight into the
loop. And it stayed in the loop until I killed the agent.

I ran several tests with the following configuration and was able to reproduce, 4 of 5 times, the problem when restarting the monitor, where the servers do not display in the dashboard for over 15 minutes while the agents display almost immediately:

Ubuntu VM
Monitor 2.0.6.71
Agent 2.0.6.7159
Single Agent - 31 5.1.35 instances monitored

(credit for Darren Oldag for spotting these entries in the logs and below input)
The agent gets into a state where it is not honoring the 'resync protocol' with the MEM server.

in the logs, i see several situations like this:

2009-09-08 13:12:01: (debug) network-io.c:1097: skipping heartbeat without resync response, resync_state RESYNC_NEED_AGENT_ID
2009-09-08 13:12:34: (debug) last message repeated 10 times

[...repeat...]

2009-09-08 14:29:37: (debug) last message repeated 10 times
2009-09-08 14:29:38: (debug) network-io.c:1097: skipping heartbeat without resync response, resync_state RESYNC_NEED_AGENT_ID

and/or ...

2009-09-06 15:20:40: (critical) scheduler.c:778: in list_known_data_items state
2009-09-06 15:20:49: (critical) last message repeated 101 times

[...repeat...]

2009-09-06 15:30:07: (critical) scheduler.c:778: in list_known_data_items state
2009-09-06 15:30:38: (critical) last message repeated 66 times

until the agent gets out of the resync state properly, the server.reachable collection cannot be retrieved, and will not mark any of the mysqld instances as 'up'.

Here somewhat different error messages followed by 4min gap in all graphs:

2009-09-17 13:17:04: (critical) exception received from server: Internal error: known items are required for list_instances processing
2009-09-17 13:17:49: (critical) G:\bs\bs\merlin\agent-2.0\src\mysql-proxy-0.7.0r1489_20090811_1738\plugins\agent\network-io.c:841: starting task 6 for mysql::server[418fdade-6547-4ccd-9556-c85cedfbc99a]

According to Leith this is most probably the same problem

Sloan Childers writes: 
available in 2.1 build 1109

Diego Medina writes: 
Verified fixed on 2.1.1.1132

A note has been added to the 2.1.1 changelog: 

 The agent could get into a state where it loop through the resynchronization of the core data, without reporting information to the service manager, causing gaps in the data.

Diego Medina writes: 
Agent 2.2.0.1611 still has the fix.

Diego Medina writes: 
Agent 2.1.1.1142 still has the fix