Bug #26279 Send failure: Connection was reset (java.lang.OutOfMemoryError: Java heap space)
Submitted: 12 Feb 2007 8:38 Modified: 7 Sep 2007 15:10
Reporter: Carsten Segieth Email Updates:
Status: Closed Impact on me:
Category:MySQL Enterprise Monitor: Server Severity:S1 (Critical)
Version: - OS:Any
Assigned to: Sloan Childers CPU Architecture:Any
Tags: java heap space, mer 120

[12 Feb 2007 8:38] Carsten Segieth
After weekend I cannot login to the server, on the login page I get the message

 Send failure: Connection was reset

- stopping and restarting the Firefox does not change behaviour
- all 3 Windows services are up and running, I didn't try with a restart up to now
- Tomcat memory currently at ~ 268 MB, peak was 303 MB
- there are (were ...?) 42 agents running; ~ 20 version and ~ 22 version, with all rules scheduled against all agents

Log files: 

Database dump: https://intranet.mysql.com/~csegieth/merlin/NET-QA2_2007-02-12-%208.51.35,91_dump.zip

How to repeat:
see above
[12 Feb 2007 13:29] Carsten Segieth
in case the agent logs are needed: /users/csegieth/mysql/network/agent/*/*/log/*.log
[15 Feb 2007 19:16] Sloan Childers
need to verify that we are not writing these huge log messages about agent deserialization problem into the server logs
[15 Feb 2007 19:17] Sloan Childers
let's make sure this is reproduceable with the latest build and that we don't have some sort of agent/server mismatch
[15 Feb 2007 19:29] Carsten Segieth
- all agents were 1.1.0 - were there any difficulties??? If so, they were not published 'important' enough
- could not be reproduced again with a newer server installation --> closed with 'can't repeat'
[7 Mar 2007 14:21] Carsten Segieth
occured again after ~ 3 1/2 days - I cannot login to the dashboard, the error message is

 * Send failure: Connection was reset

and I have tons of

 java.lang.OutOfMemoryError: Java heap space

in 'stdout_20070303.log.

PsList output is in 'windows_processes.log' of the attached zip.

Last noticed agent heartbeat was ~ 3 days after startup of the server:

*************************** 1. row ***************************
                ID: 1
              host: Merlin
               ACT: 1
               NTF: 1
agm_last_heartbeat: 2007-03-06 16:15:38
              uuid: 1a5babfa-e5da-48b3-9e0c-f219d01531f4
               INT: 10
               TRE: 10

- installed as an update of
- currently ~ 120 agents registered, where > 50 are dead
- re-inventory at 4 h
- most of the 'active' >> 50 agents have all rules scheduled at default freq.
[7 Mar 2007 14:22] Carsten Segieth
pslist output, stdout, ...

Attachment: NET-QA2_2007-03-07-10.30.58,00_logs.zip (application/x-zip-compressed, text), 102.49 KiB.

[9 Mar 2007 2:10] Sloan Childers
It does look like he has current code, it does not look like the SingleThreaded DC Writer Thread Death issue.

However, it looks like something "bad" happens with the database:
First several connection failed on recieved.
Then lots of connection refused prior to things going south.
As though the connections in the connection pool start to die, resulting in the initial failures, and then any attempts by the thread pool to acquire new connections to replace the dead ones, yeild the connection refused.

HOWEVER, this doesn't tell us anything about What happened or what caused it to happen.

So we're sort of stuck right now.

We are also not clear on why this scenario would result in an OutOfMemoryError.
[3 May 2007 16:40] Carsten Segieth
- occured again today on a RH4-x86 server that I've updated from to
- before the crash there were ~ 74 agents, half of them fresh 5336 installs and half updated 5214 to 5336. May be the problem cam from starting ~25 agents nearly at the same time after updating the software?
- logs and dump in https://intranet.mysql.com/~csegieth/bugs/26279/
[29 Jun 2007 15:44] Eric Herman
customer they get that error with 1.1.1 but not 1.1.0
[29 Jun 2007 18:24] Andy Bang
More info from Chris/customer:

o RHEL4 on x86_64
o Per customer - "No other changes. All I did was stop all the agents, upgrade the manager from 1.1 to 1.1.1, and then upgrade the agents from 1.1 to 1.1.1"
o In other words, they didn't add any agents or schedule more rules
o Summary from customer:
After upgrading the MySQL Network Monitor to version 1.1.1 the dashboard began crashing leaving the following messages on the page:

Internal Error: Java heap space
#0 /var/opt/mysql/network/monitoring/dashboard/lib/Merlin/Rest/Service.php(146): Merlin_Rest_Service->post('/merlin/monitor...', Array, 1)
#1 /var/opt/mysql/network/monitoring/dashboard/lib/Merlin/Rest/Service/Monitor.php(52): Merlin_Rest_Service->send(Array)
#2 /var/opt/mysql/network/monitoring/dashboard/lib/Merlin/Model/Rule.php(98): Merlin_Rest_Service_Monitor->listMonitors(Array)
#3 /var/opt/mysql/network/monitoring/dashboard/lib/Merlin/Controller/Dashboard.php(31): Merlin_Model_Rule->getOne('Table Scans Exc...')
#4 /var/opt/mysql/network/monitoring/dashboard/lib/Merlin/Controller/Dashboard.php(54): Merlin_Controller_Dashboard->getGridMonitors()
#5 /var/opt/mysql/network/monitoring/dashboard/lib/Merlin/Controller/Dashboard.php(214): Merlin_Controller_Dashboard->getValuesForServer(Object(Merlin_Rest_Data_Entry_InventoryServer))
#6 /var/opt/mysql/network/monitoring/dashboard/lib/Merlin/Controller/Dashboard.php(93): Merlin_Controller_Dashboard->getHeatChartInfo()
#7 [internal function]: Merlin_Controller_Dashboard->index(1, 'time', true)
#8 /var/opt/mysql/network/monitoring/dashboard/lib/Merlin/Controller.php(225): call_user_func_array(Array, Array)
#9 /var/opt/mysql/network/monitoring/dashboard/htdocs/index.php(46): Merlin_Controller->dispatch()
#10 {main}
Description is currently collapsed. Click to expand.
[7 Sep 2007 15:10] Peter Lavin
Added to the changelog for version 1.2.