Bug #29454 With several agents, get dup uuids and agents stop when restart Dashboard
Submitted: 30 Jun 2007 1:33 Modified: 16 Jul 2007 21:57
Reporter: Bill Weber Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Enterprise Monitor: Server Severity:S2 (Serious)
Version:1.2.0.6281 - 1.2.0.6550 OS:Any
Assigned to: Darren Oldag CPU Architecture:Any
Tags: mer 120

[30 Jun 2007 1:33] Bill Weber
Description:
When several agents are pinging into the Dashboard, if you restart the Dashboard (restart tomcat), the agents get duplicate uuids and stop.

How to repeat:
- start several agents (ie over 30) pinging in to Dashboard
- stop/start Dashboard (possibly several times)
[30 Jun 2007 5:29] Darren Oldag
found/fixed a case where if we sent a resynchronize BACK to the
agent, for whatever reason, we did not release the session 
identifier.  the agent would respond with -1 session, and we
would return the dup UUID problem (session in use).

please do additonal testing to make sure no other scenarios are occuring.  by visual inspection, i don't see how it could happen... but i don't promise to be perfect.
[3 Jul 2007 11:15] Carsten Segieth
With fresh installed agents 1.2.0.6311 I see several of them stopping with 'Dup UUID' messages like shown below, some 'near' a time when the dashboard was stopped / restarted, other 2.5 - 4.5 h later. The server was stopped at ~23:35 CEST and started at ~ 23:45 CEST.

A total of ~ 50% of the agents were stopped!

2007-07-02 23:52:14: (message) --> sending heartbeat
2007-07-02 23:52:14: (debug) --> sending: <doc><agentId>ab971529-2da2-4f23-ba01-996547195f72</agentId><agentUtc>2007-07-02T21:52:14.328Z</agentUtc><hostname>1.2.0.6311_41
_rhas5-x86_64_blade10_10</hostname><uuid>7d81ce42-14b3-4085-9319-9a354182fa46</uuid><version>1.2.0.6311</version><shutdown>true</shutdown><tasks/></doc>

2007-07-02 23:52:42: (debug) <-- received: <?xml version='1.0'?><exceptions><error><![CDATA[E1402:  Duplicate agent uuid "7d81ce42-14b3-4085-9319-9a354182fa46" detected.]
]></error></exceptions>

2007-07-02 23:52:42: (critical) exception received from server: E1402:  Duplicate agent uuid "7d81ce42-14b3-4085-9319-9a354182fa46" detected.
2007-07-02 23:52:42: (critical) server asked us to shutdown
2007-07-02 23:52:42: (message) stopping Agent Version: 1.2.0.6311

The agent logfiles are available in "/nfstmp1/merlin/agent/1.2.0.6311/*/*/log/10.100.1.224.*.log", e.g. from user mysqldev@production. Above example is from 'blade10', but it happened also on: aix52, debx86, buildc, hp3750, hpita2, rx2620b, rhas3-x86, rh-x86-64, net-qa1, rh-x86-32, blade10, blade11, blade01, sles9-ia64, blade08.

I've set the prio to P1 as this makes the whole system (agent/server) nearly unusable.
[5 Jul 2007 23:16] Darren Oldag
The fix I made was in r6320, so please check a build later than that.

i'm only basing this statement on the Version field of this bug, which says up to 6311 was tested.
[7 Jul 2007 19:09] Bill Weber
Assigning Carsten as the Verifier since he is still seeing the dup uuids.
[12 Jul 2007 5:21] Carsten Segieth
occured again: the agent (version 1.2.0.6550) on vista-x86 is one of 72 agents running aginst the server (fresh istalled 1.2.0.6550), and it stopped with

 2007-07-12 03:12:04: (critical) exception received from server: E1402:  Duplicate agent uuid "ddf22f1b-7866-40cf-995d-81752c6087f8" detected.
 2007-07-12 03:12:04: (critical) server asked us to shutdown

in the log. I had no debug logging, so it was the only entry in the log, and it was the only agent that stopped.

AFAIK the server was not stopped during the night. I try to save a dump of the server ...
[12 Jul 2007 6:16] Carsten Segieth
dump in https://intranet.mysql.com/~csegieth/merlin/net-qa1_2007-07-12-08.03.34_dump.tgz, server log files in ..._logs.tgz
[12 Jul 2007 15:51] Darren Oldag
I see that the agent sent in a resync command to the server.  without knowing what the agent sent in right BEFORE that, I can't debug it further.

Also, the logs were full of null pointer and log exceptions.  The state of the build was not very 'clean'.  Please try again with after-6573 build.
[16 Jul 2007 18:38] Bill Weber
Andrew found and emailed agent log that had dup uuid with debug info to Oldag.
[16 Jul 2007 21:11] Darren Oldag
THIS bug is fixed.

ANOTHER dup uuid bug is still there, but this bug is confusing the issue.

please do whatever it takes to get rid of this one, and then open another one.
[16 Jul 2007 21:57] Bill Weber
Tried starting/stopping tomcat several times with build 1.2.0.6610 - no errors.