MySQL Bugs: #32844: some started agents do not appear on dashboard until Tomcat is restarted

Bug #32844	some started agents do not appear on dashboard until Tomcat is restarted
Submitted:	29 Nov 2007 14:07	Modified:	7 Jul 2008 12:21
Reporter:	Carsten Segieth	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Enterprise Monitor: Server	Severity:	S2 (Serious)
Version:	1.3.0.8384	OS:	Any
Assigned to:	Eric Herman	CPU Architecture:	Any

Description:
Often I start a bunch of in a short time frame of 1-2 min, and often some of them do not appear on the dashboard. They are shown in inv_agents, but not in inv_servers:

mysql> SELECT agent_id AS id, host FROM merlin.inv_agents WHERE host NOT IN ( SELECT host FROM merlin.inv_servers) ORDER BY host;
+----+------------------------------------------------------+
| id | host                                                 |
+----+------------------------------------------------------+
| 20 | 1.3.0.8384_01_debian3.1-x86_debx86_17                |
| 22 | 1.3.0.8384_02_fc4-x86_buildc_21                      |
| 24 | 1.3.0.8384_03_freebsd6-x86_64_bsd60-64_24            |
| 31 | 1.3.0.8384_06_hpux11.11-hppa2.0-64bit_hpux11_58      |
| 37 | 1.3.0.8384_07_hpux11.23-ia64_hpita2_26               |
| 17 | 1.3.0.8384_08_rhas3-ia64_quadita2_16                 |
| 18 | 1.3.0.8384_14_sles10-ia64_sles10-ia64-a_16           |
| 23 | 1.3.0.8384_16_sles9-ia64_sles9-ia64_25               |
| 33 | 1.3.0.8384_22_solaris10-sparc-32bit_sol10-sparc-a_06 |
| 29 | 1.3.0.8384_24_solaris10-x86_64_sol10-amd64-a_48      |
| 27 | 1.3.0.8384_32_solaris9-x86_sol9x86_38                |
| 15 | 1.3.0.8384_42_rhas5-x86_blade11_02                   |
+----+------------------------------------------------------+
12 rows in set (0.33 sec)

Waiting a while (> 15 min) does not help, but when I simply restart the tomcat with 

 ./mysqlmonitorctl.sh restart tomcat
 
it takes only seconds until all 'missing servers' are registered correct and appear on the dashboard: 

mysql> SELECT agent_id AS id, host FROM merlin.inv_agents WHERE host NOT IN ( SELECT host FROM merlin.inv_servers) ORDER BY host;
Empty set (0.47 sec)

Work around:
------------
As described, restart the Tomcat process help for me.

How to repeat:
- install clean server
- start a lot of agents (here: >35) in a short time (1-2 min)
- check dashboard and database content
- restart Tomcat
- see that the agents are now registered correct and shown

Suggested fix:
- even if processing all incoming actions take a while, don't give it up ...

this has proven to be a _very_ difficult problem to reproduce reliably in development. However there are a few findings:

(0) this is a "known" issue: quite some time ago we determined that we had similar issues and were able to "solve" them by introducing a delay of 1 second between agent startups in scripts that start many agents at once. This hard to reproduce in test situation seems to only occur if, after a fresh start of the MEM dashboard, many agents all ping in at the same time. This is a scenario which isn't likely in the "real world" and is much more likely in testing situations. Perhaps this can be viewed a documentation issue?

(1) Once contributing problem seems to be some contention around getting a database connection from the connection pool when the number of agents exceeds the the maximum number of database connections. Normally this is not a problem since incoming requests will wait until a thread becomes available, however, upon initial startup, the first listInventory requests are relatively long-running processes, and several simultanious ones all stack up behind some mutexes.

(2) Another contributing problems seems to be contention around the ItemsCache initialization. We may be able to reduce contention by replacing generic synchronization with a ReentrantReadWriteLock. 

(3) currently we re-try 3 times on MySQLTransactionRollbackException, we might wish to tune that some-what.

This has been an intermittent problem since the 1.0 release.  There is too much risk to try to fix this for the 1.3 release.  Since no customers have reported this issue I'm going to move it forward to the 2.0 release for re-testing.  Sloan

Problem never occured in any of my 2.0 tests, I'll re-open the problem it it does occur.