Description:
I was running sysbench benchmarks with the 8.0 client library and
noticed that sysbench takes an unusually long time to "warm up":
[ 1s ] thds: 1024 tps: 76.28 qps: 2262.99 (r/w/o: 2026.93/0.00/236.06) lat (ms,95%): 8638.96 err/s: 0.00 reconn/s: 0.00
[ 13s ] thds: 1024 tps: 15443.84 qps: 245703.40 (r/w/o: 214793.65/0.00/30909.74) lat (ms,95%): 1032.01 err/s: 0.00 reconn/s: 0.00
[ 14s ] thds: 1024 tps: 10660.46 qps: 171515.46 (r/w/o: 150188.53/0.00/21326.93) lat (ms,95%): 337.94 err/s: 0.00 reconn/s: 0.00
[ 15s ] thds: 1024 tps: 6271.36 qps: 99505.91 (r/w/o: 86959.18/0.00/12546.73) lat (ms,95%): 484.44 err/s: 0.00 reconn/s: 0.00
[ 16s ] thds: 1024 tps: 20259.38 qps: 320745.02 (r/w/o: 280335.26/0.00/40409.76) lat (ms,95%): 240.02 err/s: 0.00 reconn/s: 0.00
[ 17s ] thds: 1024 tps: 34725.13 qps: 555273.09 (r/w/o: 485875.83/0.00/69397.26) lat (ms,95%): 32.53 err/s: 0.00 reconn/s: 0.00
[ 18s ] thds: 1024 tps: 35106.95 qps: 561789.27 (r/w/o: 491543.36/0.00/70245.91) lat (ms,95%): 31.94 err/s: 0.00 reconn/s: 0.00
[ 19s ] thds: 1024 tps: 35297.31 qps: 564603.88 (r/w/o: 494006.27/0.00/70597.61) lat (ms,95%): 31.94 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 1024 tps: 35678.97 qps: 570236.52 (r/w/o: 498881.58/0.00/71354.94) lat (ms,95%): 31.94 err/s: 0.00 reconn/s: 0.00
Note how during 1-13 seconds the machine was too busy to even print
stats, which clearly indicates a scalability issue. I've never seen that
behavior with 5.7 client libraries, even in high-concurrency benchmarks
on machines with many cares, which was the case here. So I started
digging, and this is what I found.
During the initial stage sysbench was contending on mprotect() system calls
from malloc():
# pt-pmp -p $(pidof sysbench)
397 mprotect,grow_heap,sysmalloc,_int_malloc,__GI___libc_malloc,libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
397 __memcpy_falkor,libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
112 sysmalloc,_int_malloc,__GI___libc_malloc,libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
66 libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
12 __memcpy_falkor,libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??
8 mprotect,::??,::??,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
7 sysmalloc,_int_malloc,__GI___libc_malloc,libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??
7 mprotect,grow_heap,sysmalloc,_int_malloc,__GI___libc_malloc,libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??
4 mprotect,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
4 ::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
3 libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??
1 sb_histogram_update,sb_event_stop,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
1 sb_histogram_get_pct_intermediate,_start
1 __pthread_clockjoin_ex,sb_thread_join_workers,main
1 ::??,::??,::??,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
1 ::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
1 ::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
1 ::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
1 __lll_lock_wait_private,_int_free,libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_execute(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
1 libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
I can understand contention in this case -- it is likely related to an
old scalability issue in the Linux kernel where concurrent mmap() /
mprotect() calls
from multiple threads in a single process are serialized (see
https://lkml.org/lkml/2013/1/2/174).
The weird part was that I've never seen that with 5.7 client libs.
It turns out the allocation logic has been changed in 8.0. In 5.7 some
memory is preallocated on connection creation on connection's MEM_ROOT,
see the pre_alloc_size parameter in init_alloc_root(). So sysbench would
preallocate memory on connection creation before the actual benchmark is
started.
In 8.0 that parameter to init_alloc_root() is ignored. So allocation
happens during the initial stage of the benchmark, while malloc() is
creating per-thread caches, causing the mmap()/mprotect()
bottleneck. The commit that introduced that change has the following to
say:
---
Author: Steinar H. Gunderson <steinar.gunderson@oracle.com>
AuthorDate: Thu Sep 28 12:57:22 2017 +0200
Bug #26940369: REWRITE THE MEM_ROOT
Streamline and optimize the MEM_ROOT.
...
- Preallocation was hardly ever used and caused a fair amount of complexity,
so it has been removed.
...
sysbench tests show somewhere between 1% and 6% qps increase.
---
I can believe the change might be beneficial for long-running benchmarks
where that drop on the initial stage is averaged out, but it certainly
hinders short-running benchmarks by killing client scalability on the
initial stage, which is how the problem was detected.
A lot of people probably get skewed benchmark numbers in their
high-concurrency 8.0 benchmarks because of that change, and it's not
clear how to resolve it without restoring the previous behavior, at
least for the client library.
Details about the system: ARM, 64 cores, Ubuntu 20.04. Replacing the
glibc malloc() with jemalloc improves the situation a little, but not
too much -- any allocator has to create a cache of allocations on the
initial stage by requesting them from the kernel via mmap() and friends.
Details about the benchmark: sysbench OLTP RO @ 1024 threads.
I realize the chances for this to be fixed are pretty thin. But I wanted
to report it anyway, so at least it is a known problem.
How to repeat:
See above.
Suggested fix:
Restore the preallocation behavior in MEM_ROOT.