Bug #97935 Memory leak in client connection using information_schema
Submitted: 10 Dec 2019 2:01 Modified: 16 Jan 15:25
Reporter: Daniel Nichter Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Information schema Severity:S2 (Serious)
Version:5.7.28 OS:Any
Assigned to: CPU Architecture:x86
Tags: 5.7, client, connect, disconnect, information_schema, massif, memory-leak, valgrind, vio

[10 Dec 2019 2:01] Daniel Nichter
Description:
Using MySQL Community Server 5.7.28, memory consumption (RSS) increases without bounds as clients connect and disconnect. Valgrind Massif heap profiler attributes the allocation to the call stack around mysql_socket_vio_new (vio.c:286):

100.00% (14,625,398,784B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc.
->99.49% (14,551,310,336B) 0x61A5999: mmap (in /usr/lib64/libc-2.17.so)
| ->44.97% (6,576,668,672B) 0x612D57F: new_heap (in /usr/lib64/libc-2.17.so)
| | ->44.97% (6,576,668,672B) 0x612DFB2: arena_get2.isra.3 (in /usr/lib64/libc-2.17.so)
| |   ->44.97% (6,576,668,672B) 0x61330FC: malloc (in /usr/lib64/libc-2.17.so)
| |     ->41.76% (6,106,906,624B) 0xF7E6F6: my_malloc (my_malloc.c:203)
| |     | ->41.30% (6,039,797,760B) 0x12F49E5: mysql_socket_vio_new (vio.c:286)
| |     | | ->41.30% (6,039,797,760B) 0xEE4F49: Channel_info::create_thd() (channel_info.cc:35)
| |     | |   ->33.04% (4,831,838,208B) 0xE2493C: Channel_info_tcpip_socket::create_thd() (socket_connection.cc:229)
| |     | |   | ->33.04% (4,831,838,208B) 0xE21951: handle_connection (connection_handler_per_thread.cc:189)
| |     | |   |   ->33.04% (4,831,838,208B) 0x10004E2: pfs_spawn_thread (pfs.cc:2197)
| |     | |   |     ->33.04% (4,831,838,208B) 0x4C2DE23: start_thread (in /usr/lib64/libpthread-2.17.so)
| |     | |   |       ->33.04% (4,831,838,208B) 0x61AB34B: clone (in /usr/lib64/libc-2.17.so)
| |     | |   |         
| |     | |   ->08.26% (1,207,959,552B) 0xE249CB: Channel_info_local_socket::create_thd() (socket_connection.cc:176)
| |     | |     ->08.26% (1,207,959,552B) 0xE21951: handle_connection (connection_handler_per_thread.cc:189)
| |     | |       ->08.26% (1,207,959,552B) 0x10004E2: pfs_spawn_thread (pfs.cc:2197)
| |     | |         ->08.26% (1,207,959,552B) 0x4C2DE23: start_thread (in /usr/lib64/libpthread-2.17.so)
| |     | |           ->08.26% (1,207,959,552B) 0x61AB34B: clone (in /usr/lib64/libc-2.17.so)
| |     | |             
| |     | ->00.46% (67,108,864B) in 1+ places, all below ms_print's threshold (01.00%)
| |     | 
| |     ->01.84% (268,435,456B) 0x58F21BB: operator new(unsigned long) (in /usr/lib64/libstdc++.so.6.0.19)
| |     | ->01.84% (268,435,456B) in 3 places, all below massif's threshold (1.00%)
| |     |   
| |     ->01.38% (201,326,592B) in 2 places, all below massif's threshold (1.00%)
| |       
| ->42.21% (6,174,015,488B) 0x612D50E: new_heap (in /usr/lib64/libc-2.17.so)
| | ->42.21% (6,174,015,488B) 0x612DFB2: arena_get2.isra.3 (in /usr/lib64/libc-2.17.so)
| |   ->42.21% (6,174,015,488B) 0x61330FC: malloc (in /usr/lib64/libc-2.17.so)
| |     ->40.84% (5,972,688,896B) 0xF7E6F6: my_malloc (my_malloc.c:203)
| |     | ->40.84% (5,972,688,896B) 0x12F49E5: mysql_socket_vio_new (vio.c:286)
| |     |   ->40.84% (5,972,688,896B) 0xEE4F49: Channel_info::create_thd() (channel_info.cc:35)
| |     |     ->32.12% (4,697,620,480B) 0xE2493C: Channel_info_tcpip_socket::create_thd() (socket_connection.cc:229)
| |     |     | ->32.12% (4,697,620,480B) 0xE21951: handle_connection (connection_handler_per_thread.cc:189)
| |     |     |   ->32.12% (4,697,620,480B) 0x10004E2: pfs_spawn_thread (pfs.cc:2197)
| |     |     |     ->32.12% (4,697,620,480B) 0x4C2DE23: start_thread (in /usr/lib64/libpthread-2.17.so)
| |     |     |       ->32.12% (4,697,620,480B) 0x61AB34B: clone (in /usr/lib64/libc-2.17.so)
| |     |     |         
| |     |     ->08.72% (1,275,068,416B) 0xE249CB: Channel_info_local_socket::create_thd() (socket_connection.cc:176)
| |     |       ->08.72% (1,275,068,416B) 0xE21951: handle_connection (connection_handler_per_thread.cc:189)
| |     |         ->08.72% (1,275,068,416B) 0x10004E2: pfs_spawn_thread (pfs.cc:2197)
| |     |           ->08.72% (1,275,068,416B) 0x4C2DE23: start_thread (in /usr/lib64/libpthread-2.17.so)
| |     |             ->08.72% (1,275,068,416B) 0x61AB34B: clone (in /usr/lib64/libc-2.17.so)

Important to know:
* MySQL is read_only=1 replica, no writes. (A few writes are replicating from master.)
* Performance Schema is OFF.
* Happens on many separate and unrelated MySQL clusters with different workloads--it's not isolated to the lab.
* MySQL is running in a Docker container, but this memory leak can be reproduced outside Docker.
* Test code runs same primary key SELECT (so little to no data is needed in buffer pool).
* InnoDB buffer pool is 1G; test table/data is 1k rows (so buffer pool size is no issue).
* Happens on 5.7.26, too.
* Memory is never released.
* Memory consumption often continues to increase after test to reproduce it stops.

Full Valgrind Massif report, SHOW GLOBAL VARIABLES, log of RSS, and my.cnf at https://github.com/daniel-nichter/lab/tree/master/mysql-mem-leak

How to repeat:
In general: connect hundreds of clients at a time, query for a few seconds, disconnect all, repeat several times (< 15 mins). This causes unbounded memory increase like: https://github.com/daniel-nichter/lab/blob/master/mysql-mem-leak/mysql-memleak-1.png
That test stopped around 17:45 when Threads Connected dropped from the last big spike, but memory consumption continued to increase for many hours.

Specifically, this sequence reliably reproduces the issue:
* 200 clients (connect) doing 500 QPS (across all clients) for 5s (then all disconnect)
* 250 clients doing 500 QPS for 10s
* 200 clients doing 500 QPS for 10s
* 250 clients doing 1000 QPS for 5s
* 300 clients doing 1000 QPS for 5s
* 250 clients doing 1000 QPS for 20s
* 250 clients doing 1000 QPS for 20s
* 300 clients doing 1000 QPS for 60s
* 400 clients doing 1000 QPS for 60s
* 100 clients doing 1000 QPS for 60s

Another simpler method:
* 300 clients doing 1000 QPS for 30 minutes.

That produced the purple line in this graph starting at 18:30: https://github.com/daniel-nichter/lab/blob/master/mysql-mem-leak/mysql-memleak-2.png
Important to note: memory usage began to increase more rapidly at the _end_ of the test at 19:00.
[10 Dec 2019 13:01] Bogdan Kecman
Hi,
I was testing something similar recently but could not reliably reproduce.

Can you please give me your config file.

Also, you stop the test, disconnect all the clients, wait few minutes and the ram is not freed right?

thanks
[10 Dec 2019 13:32] Daniel Nichter
my.cnf

Attachment: my.cnf (application/octet-stream, text), 6.44 KiB.

[10 Dec 2019 13:33] Daniel Nichter
SHOW GLOBAL VARIABLES

Attachment: global_vars.txt (text/plain), 15.19 KiB.

[10 Dec 2019 13:43] Daniel Nichter
Hi Bogdan,

Added config file and SHOW GLOBAL VARIABLES output.

Yes, when I stop my test program, all the test connections disconnect, but the memory is never freed. I've tried many things to get MySQL to free some memory (without restarting MySQL), but nothing works. We have many real databases exhibiting this issue and, so far, none of them free the memory, even when we failover off a node to make it read-only and inactive (no connections apart from replication and periodic monitoring). The only workaround is restarting MySQL.

Thanks!
[10 Dec 2019 13:45] Bogdan Kecman
Thanks, all clear. Let me try to reproduce.

You don't have any events running?

all best
[10 Dec 2019 13:54] Daniel Nichter
No events. My test instance was newly provisioned, so it has no real data or usage.

Thanks for looking into this! Let me know if you need more info or want me to run other experiments on my side.
[10 Dec 2019 16:24] Bogdan Kecman
Hi Daniel,

I'm getting mysql to increase ram usage and not return ram but I cannot get it to allocate more ram than is expected, I reach the "cap" and it does not allocate over that cap. Did you manage to get it to OOM or ?
[10 Dec 2019 17:06] Daniel Nichter
Hi Bogdan,

We don't OOM kill MySQL. Instead, on certain hosts running many MySQL instances, MySQL will eventually use all RAM which causes swapping then high system load then (sometimes) the system becomes unresponsive (unless we catch it ahead of time, fail over, and restart all the MySQL on the host).

What cap are you reaching? How much do you expect it to allocate max? (What's your setup and results in general?)
[10 Dec 2019 17:41] Bogdan Kecman
Not reproducing :( 

hitting slave with 100, 200, 250, 500, 1000, 200, 200, 500 threads, each thread doing approx 10000 selects (simple PK ones, range, long range etc etc..)

and the ram does increase while slave is hit but up to a point, and when I stop, it drops down (of course, not totally, but to a point) ..

   PID %MEM   RSZ COMMAND
 38213  9.6 387600 mysqld
 38213 14.1 566580 mysqld
 38213 14.2 571596 mysqld
 38213 14.2 571596 mysqld
 38213 14.3 573720 mysqld
 38213 14.3 573720 mysqld
 38213 14.3 573720 mysqld
 38213 14.3 573720 mysqld
 38213 14.3 573720 mysqld
 38213 14.3 573936 mysqld
 38213 14.3 573808 mysqld
 38213 14.3 574656 mysqld
 38213 14.3 574528 mysqld
 38213 14.3 573976 mysqld
 38213 14.3 574504 mysqld
 38213 14.3 573800 mysqld
 38213 14.3 573800 mysqld
 38213 14.3 573800 mysqld
[10 Dec 2019 17:43] Bogdan Kecman
this is all connecting trough socket, I'm running a test now connecting to the network port to see if that makes a difference
[10 Dec 2019 20:16] Daniel Nichter
I think the issue might be Docker. I managed to un-containerize my test MySQL instance and run it outside Docker and I could not reproduce the issue. Outside Docker, MySQL behaves as expected: some memory is allocated, but when the test stops, RSS stops increasing. There's about 35% memory increase, which I'd call normal.

Put it back in Docker and the issue is easily reproducible. Given that Valgrind attributes the memory allocation to mysqld, it suggests that Docker is interfering with low-level memory allocation. There's >150% memory increase.

We can put this bug report on hold while I continue to test.
[10 Dec 2019 20:36] Bogdan Kecman
Hi,

I understood you can reproduce this outside of docker too.

Running this for few hours and .. I don't see any invalid ram allocations. RAM usage is stable and drops down after connections are released.

Tried both with prepared statements and without, both trough socket and port, from 200 to 1500 connections in the same time with all type of read only queries.

last run:
 38213 14.3 573800 mysqld
 38213 14.3 573920 mysqld
 38213 14.3 574504 mysqld
 38213 14.3 574608 mysqld
 38213 14.3 574464 mysqld
 38213 14.3 574568 mysqld
 38213 14.3 574480 mysqld
 38213 14.3 574616 mysqld
 38213 14.3 574504 mysqld
 38213 14.3 574664 mysqld
 38213 14.3 574504 mysqld
 38213 14.3 574576 mysqld
 38213 14.3 574448 mysqld
 38213 14.3 574704 mysqld
 38213 14.3 574528 mysqld
 38213 14.3 574600 mysqld
 38213 14.3 574512 mysqld
 38213 14.3 574600 mysqld
 38213 14.3 574488 mysqld
 38213 14.3 574632 mysqld
 38213 14.3 574520 mysqld
 38213 14.3 574608 mysqld
 38213 14.3 574568 mysqld
 38213 14.3 574704 mysqld
 38213 14.3 574528 mysqld
 38213 14.3 574504 mysqld
 38213 14.3 574608 mysqld
 38213 14.3 574696 mysqld
 38213 14.3 574640 mysqld
 38213 14.3 574432 mysqld
 38213 14.3 574568 mysqld
 38213 14.3 574480 mysqld
 38213 14.3 574600 mysqld
 38213 14.3 574472 mysqld
 38213 14.3 573856 mysqld

The docker issue on the other hand is probbly some docker bug.

all best
[10 Dec 2019 20:50] Daniel G.
We definitely saw a memory leak with this release and needed to pin back to a previous package version. This was on Ubuntu 16.04 using a localhost MYSQL server and client libs to support Kamailio Sip Server instances.
[11 Dec 2019 14:41] Daniel G.
In our case it was the client lib which exhibited the memory leak, restarting the client process (kamailio) resolved the issue. Left unchecked it would crash OOM, core dumps pointed to libmysqlclient.so.20. Increased load (queries) accelerated the memory leak.  Downgrading to 5.7.11 (only available downgrade) was the final resolution. Also I believe it was not present in 5.7.26. This was on 4 different AWS instances running Ubuntu 16.0.4LTS accessing independent localhost mysql DBs.
[13 Dec 2019 20:50] Daniel Nichter
Hi again Bogdan,

I've isolated this to the information_schema. Previously, I mentioned that memory usage continued to grow even after I stopped my test. This seemed really weird, so I thought more about it and realized: my test instance was not fully isolated. We have monitoring running every 30s. When I stopped the monitor, the memory usage growth stopped, too. So I looked into what the monitoring was doing and I narrowed it down to information_schema queries.

There are two ways to reproduce using a standard 5.7.28 Community Server build (running on bare metal--no Docker):

Method 1)
1. Start mysqld
2. Start monitor.sh (which I'll upload)
3. Start logging RSS (which I'll also upload)
4. Warm of MySQL with a sequence of clients and QPS like originally described

I did a 2h test and grew RSS from 175 MB to 1.6 GB (will upload 2h-rss-mysqld-direct.csv). This was a completed isolated mysqld process, so nothing but the monitoring queries running, querying information_schema.

Method 2)
1. Start mysqld
2. Start logging RSS (which I'll also upload)
3. Warm of MySQL with a sequence of clients and QPS like originally described
4. When 4 stops, wait 15-20 minutes for RSS to stabilize, then...
5. Exec "SHOW GLOBAL STATUS" every 30s
6. Watch RSS grow.
7. Stop 5 and wait abotu 15 min for RSS to stabilize again.

For method 2, see method-2-rss-graph.png that I'll upload. As far as I can tell from the source code, SHOW GLOBAL STATUS is implemented as an information_schema.global_status query under the hood, which explains why it causes memory usage to increase, too.

I started step 5 about 1.5 hours later and it caused RSS to increase even more until I stopped it again. (Not shown in the graph uploaded.)

--

I definitely think there's a memory leak here. Especially my 2 hour method 1 run, there's no reason MySQL should have allocated 1.6G. And it's an isolated mysqld instance on bare metal, so it's definitely not a Docker issue.

Hope this helps reproduce it on your end!
[13 Dec 2019 20:54] Daniel Nichter
monitor.sh

Attachment: monitor.sh (text/x-sh), 333 bytes.

[13 Dec 2019 20:55] Daniel Nichter
Log changes to RSS of given PID

Attachment: record-rss.sh (text/x-sh), 361 bytes.

[13 Dec 2019 20:56] Daniel Nichter
2h output of record-rss.sh from running monitor.sh

Attachment: 2h-rss-mysqld-direct.csv (text/csv), 14.83 KiB.

[13 Dec 2019 20:58] Daniel Nichter
Graph of mysqld RSS from running method 2

Attachment: method-2-rss-graph.png (image/png, text), 101.26 KiB.

[13 Dec 2019 23:49] Bogdan Kecman
Hi,

Thanks for the test case, I'll retest first thing Monday morning.

No replication in this one? Single server required for reproduction?
[15 Dec 2019 16:29] Daniel Nichter
Great, thanks! -- Yes, just a standalone server, no replication.
[2 Jan 19:29] Daniel Nichter
Any success reproducing this?
[16 Jan 0:54] Bogdan Kecman
Hi,

Thanks for the test case, verified!
[16 Jan 15:25] Daniel Nichter
Excellent, thank you Bogdan!

I think fixing this bug will be a big win for MySQL because information_schema queries and SHOW GLOBAL STATUS are ubiquitous.