Bug #49739 agent leaks (with unstable SSL connection?) (curl problem?)
Submitted: 16 Dec 2009 13:43 Modified: 20 Apr 2010 15:11
Reporter: Andrii Nikitin Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Enterprise Monitor: Agent Severity:S2 (Serious)
Version:2.1.0.1093 OS:Any
Assigned to: Michael Schuster CPU Architecture:Any

[16 Dec 2009 13:43] Andrii Nikitin
Description:
Specific testing environment shows that agent increases memory usage ~50-100M in 24 hours.
valgrind reports much slower leaks (but they may be related):

406,095 (22,528 direct, 383,567 indirect) bytes in 88 blocks are definitely lost in loss record 156 of 156
   at 0x4C25153: malloc (vg_replace_malloc.c:195)
   by 0x7227189: default_malloc_ex (in /home/a/mysql/enterprise/agent/lib/mysql-proxy/libcrypto.so.0.9.8)
   by 0x7226E8A: CRYPTO_malloc (in /home/a/mysql/enterprise/agent/lib/mysql-proxy/libcrypto.so.0.9.8)
   by 0x70AABF6: SSL_SESSION_new (in /home/a/mysql/enterprise/agent/lib/mysql-proxy/libssl.so.0.9.8)
   by 0x70AACCB: ssl_get_new_session (in /home/a/mysql/enterprise/agent/lib/mysql-proxy/libssl.so.0.9.8)
   by 0x7096120: ssl3_get_server_hello (in /home/a/mysql/enterprise/agent/lib/mysql-proxy/libssl.so.0.9.8)
   by 0x70958BC: ssl3_connect (in /home/a/mysql/enterprise/agent/lib/mysql-proxy/libssl.so.0.9.8)
   by 0x70A7684: SSL_connect (in /home/a/mysql/enterprise/agent/lib/mysql-proxy/libssl.so.0.9.8)
   by 0x6F5FB8D: ossl_connect_step2 (in /home/a/mysql/enterprise/agent/lib/mysql-proxy/libcurl.so.4.1.1)
   by 0x6F614B1: ossl_connect_common (in /home/a/mysql/enterprise/agent/lib/mysql-proxy/libcurl.so.4.1.1)
   by 0x6F61562: Curl_ossl_connect (in /home/a/mysql/enterprise/agent/lib/mysql-proxy/libcurl.so.4.1.1)
   by 0x6F6DD54: Curl_ssl_connect (in /home/a/mysql/enterprise/agent/lib/mysql-proxy/libcurl.so.4.1.1)

LEAK SUMMARY:
   definitely lost: 22,880 bytes in 91 blocks
   indirectly lost: 395,957 bytes in 8,750 blocks
     possibly lost: 29,560 bytes in 16 blocks
   still reachable: 49,797 bytes in 374 blocks
        suppressed: 0 bytes in 0 blocks

How to repeat:
Not sure which steps necessary are which are sufficient, because leak is not stable.

1. Start MEM dashboard with SSL support enabled

2. configure agent on the same host with following options:

2.1 use SSL and ethernet IP to connect MEM dashboard (not localhost nor 127.0.0.1), e.g. :
agent-mgmt-hostname = https://agent:1@192.168.1.1:18443/heartbeat

2.2 use proxy to connect MEM repository:
proxy-address=:6446
proxy-backend-addresses = 127.0.0.1:13306

2.3 configure agent/etc/instances/ to monitor MEM repository and one more mysql instance (i.e. create mysql/agent-instance.ini and mysql2/agent-instance.ini)

3. start agent (optionally use valgrind to collect leaks statistics)

4. Change parameter "mysql.port" to 6446 in 
monitor/apache-tomcat/webapps/ROOT/WEB-INF/config.properties
restart tomcat so it uses proxy from the agent

5. Enable QUAN for all servers

6. run following infinite script to provide workload on proxy:
x=1
while [ $x -le 5 ]
do
mysqlslap -uservice_manager -p1 -h127.0.0.1 -P6446 --auto-generate-sql --auto-generate-sql-load-type=write --concurrency=10 --engine=innodb --commit=1 --iterations=1 --number-of-queries=10000
sleep 10
done

7. Revoke 'SUPER' privilege from agent user to receive some errors trying to collect INNODB STATUS

8. run (sudo) following infinite loop to emulate unstable network connection: (please note that only agent->dashboard connection will be unstable, the others should be fine)

c=1
while [ $c -le 5 ]
do
  ifconfig eth0 down
  sleep 120
  ifconfig eth0 up
  sleep 350
done

9. check following error messages continuously appear in agent log:
(critical) job_collect_mysql.c:1849: fetching the QUAN config failed
(critical) network-io.c:277: curl_easy_perform('https://agent:1@10.0.2.15:18443/heartbeat';) failed: Failed to connect to 10.0.2.15: Network is unreachable (curl-error = 'Couldn't connect to server' (7))
(critical) network-io.c:310: successfully reconnected to dashboard at https://agent:1@10.0.2.15:18443/heartbeat
(critical) executing 'SHOW /*!50000 ENGINE */ INNODB STATUS' failed: Access denied; you need the PROCESS privilege for this operation (122

10. see agent memory usage increase, it should be ~20M in few hours, ~50-100M in 24hours, like 235->270->303->370 M

Suggested fix:
fix leaks (test curl?)
[17 Dec 2009 3:14] Enterprise Tools JIRA Robot
Diego Medina writes: 
Quick notes:

To reproduce on mac, I am trying this:

{noformat}
while [ true ] ; 
  do  sudo  ipfw -a add deny tcp from any to any 48443 ;
  sleep 10 ; 
  echo "delete " ; 
  sudo ipfw -a delete 00110 ; 
  sleep 120 ;  
done ;
{noformat}

My dashboard uses 48443 for the ssl port.

It is still too early to say that I also see this bug.
[24 Dec 2009 20:01] Enterprise Tools JIRA Robot
Diego Medina writes: 
Much easier way to reproduce (tested on mac os 10.5)

1- Install and agent and service manager
2- Set the service manager to send queries through the proxy (selfquan)
3- run this loop to restart tomcat, this results in the agent reaching a timeout and a " Unknown SSL protocol error in connection to" error

{noformat}
while [ true ] ;
  do ./monitor.sh 22 restart tomcat ;
  echo "Server is up" ; 
  sleep 60 ; 
done ;
{noformat}

in about an hour or two, the agent was at 20BM and still going up
[7 Jan 2010 21:43] Enterprise Tools JIRA Robot
Keith Russell writes: 
Patch installed in versions => 2.2.0.1597.
[8 Apr 2010 11:41] Enterprise Tools JIRA Robot
Diego Medina writes: 
Verified fixed on 2.2.0.1686.

after running the test for two hours, memory wa stable at 20MB, while before it would reach 200MB in just an hour
[20 Apr 2010 15:11] MC Brown
A note has been added to the 2.2.0 changelog: 

The memory footprint of &merlin_agent; would slowly increase over time, particularly on Mac OS X and Solaris/OpenSolaris.