Description:
I was running sysbench benchmarks with the 8.0 client library and
noticed that sysbench takes an unusually long time to "warm up":
[ 1s ] thds: 1024 tps: 76.28 qps: 2262.99 (r/w/o: 2026.93/0.00/236.06) lat (ms,95%): 8638.96 err/s: 0.00 reconn/s: 0.00
[ 13s ] thds: 1024 tps: 15443.84 qps: 245703.40 (r/w/o: 214793.65/0.00/30909.74) lat (ms,95%): 1032.01 err/s: 0.00 reconn/s: 0.00
[ 14s ] thds: 1024 tps: 10660.46 qps: 171515.46 (r/w/o: 150188.53/0.00/21326.93) lat (ms,95%): 337.94 err/s: 0.00 reconn/s: 0.00
[ 15s ] thds: 1024 tps: 6271.36 qps: 99505.91 (r/w/o: 86959.18/0.00/12546.73) lat (ms,95%): 484.44 err/s: 0.00 reconn/s: 0.00
[ 16s ] thds: 1024 tps: 20259.38 qps: 320745.02 (r/w/o: 280335.26/0.00/40409.76) lat (ms,95%): 240.02 err/s: 0.00 reconn/s: 0.00
[ 17s ] thds: 1024 tps: 34725.13 qps: 555273.09 (r/w/o: 485875.83/0.00/69397.26) lat (ms,95%): 32.53 err/s: 0.00 reconn/s: 0.00
[ 18s ] thds: 1024 tps: 35106.95 qps: 561789.27 (r/w/o: 491543.36/0.00/70245.91) lat (ms,95%): 31.94 err/s: 0.00 reconn/s: 0.00
[ 19s ] thds: 1024 tps: 35297.31 qps: 564603.88 (r/w/o: 494006.27/0.00/70597.61) lat (ms,95%): 31.94 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 1024 tps: 35678.97 qps: 570236.52 (r/w/o: 498881.58/0.00/71354.94) lat (ms,95%): 31.94 err/s: 0.00 reconn/s: 0.00
Note how during 1-13 seconds the machine was too busy to even print
stats, which clearly indicates a scalability issue. I've never seen that
behavior with 5.7 client libraries, even in high-concurrency benchmarks
on machines with many cares, which was the case here. So I started
digging, and this is what I found.
During the initial stage sysbench was contending on mprotect() system calls
from malloc():
# pt-pmp -p $(pidof sysbench)
397 mprotect,grow_heap,sysmalloc,_int_malloc,__GI___libc_malloc,libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
397 __memcpy_falkor,libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
112 sysmalloc,_int_malloc,__GI___libc_malloc,libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
66 libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
12 __memcpy_falkor,libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??
8 mprotect,::??,::??,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
7 sysmalloc,_int_malloc,__GI___libc_malloc,libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??
7 mprotect,grow_heap,sysmalloc,_int_malloc,__GI___libc_malloc,libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??
4 mprotect,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
4 ::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
3 libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??
1 sb_histogram_update,sb_event_stop,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
1 sb_histogram_get_pct_intermediate,_start
1 __pthread_clockjoin_ex,sb_thread_join_workers,main
1 ::??,::??,::??,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
1 ::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
1 ::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
1 ::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
1 __lll_lock_wait_private,_int_free,libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_execute(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
1 libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
I can understand contention in this case -- it is likely related to an
old scalability issue in the Linux kernel where concurrent mmap() /
mprotect() calls
from multiple threads in a single process are serialized (see
https://lkml.org/lkml/2013/1/2/174).
The weird part was that I've never seen that with 5.7 client libs.
It turns out the allocation logic has been changed in 8.0. In 5.7 some
memory is preallocated on connection creation on connection's MEM_ROOT,
see the pre_alloc_size parameter in init_alloc_root(). So sysbench would
preallocate memory on connection creation before the actual benchmark is
started.
In 8.0 that parameter to init_alloc_root() is ignored. So allocation
happens during the initial stage of the benchmark, while malloc() is
creating per-thread caches, causing the mmap()/mprotect()
bottleneck. The commit that introduced that change has the following to
say:
---
Author: Steinar H. Gunderson <steinar.gunderson@oracle.com>
AuthorDate: Thu Sep 28 12:57:22 2017 +0200
Bug #26940369: REWRITE THE MEM_ROOT
Streamline and optimize the MEM_ROOT.
...
- Preallocation was hardly ever used and caused a fair amount of complexity,
so it has been removed.
...
sysbench tests show somewhere between 1% and 6% qps increase.
---
I can believe the change might be beneficial for long-running benchmarks
where that drop on the initial stage is averaged out, but it certainly
hinders short-running benchmarks by killing client scalability on the
initial stage, which is how the problem was detected.
A lot of people probably get skewed benchmark numbers in their
high-concurrency 8.0 benchmarks because of that change, and it's not
clear how to resolve it without restoring the previous behavior, at
least for the client library.
Details about the system: ARM, 64 cores, Ubuntu 20.04. Replacing the
glibc malloc() with jemalloc improves the situation a little, but not
too much -- any allocator has to create a cache of allocations on the
initial stage by requesting them from the kernel via mmap() and friends.
Details about the benchmark: sysbench OLTP RO @ 1024 threads.
I realize the chances for this to be fixed are pretty thin. But I wanted
to report it anyway, so at least it is a known problem.
How to repeat:
See above.
Suggested fix:
Restore the preallocation behavior in MEM_ROOT.
Description: I was running sysbench benchmarks with the 8.0 client library and noticed that sysbench takes an unusually long time to "warm up": [ 1s ] thds: 1024 tps: 76.28 qps: 2262.99 (r/w/o: 2026.93/0.00/236.06) lat (ms,95%): 8638.96 err/s: 0.00 reconn/s: 0.00 [ 13s ] thds: 1024 tps: 15443.84 qps: 245703.40 (r/w/o: 214793.65/0.00/30909.74) lat (ms,95%): 1032.01 err/s: 0.00 reconn/s: 0.00 [ 14s ] thds: 1024 tps: 10660.46 qps: 171515.46 (r/w/o: 150188.53/0.00/21326.93) lat (ms,95%): 337.94 err/s: 0.00 reconn/s: 0.00 [ 15s ] thds: 1024 tps: 6271.36 qps: 99505.91 (r/w/o: 86959.18/0.00/12546.73) lat (ms,95%): 484.44 err/s: 0.00 reconn/s: 0.00 [ 16s ] thds: 1024 tps: 20259.38 qps: 320745.02 (r/w/o: 280335.26/0.00/40409.76) lat (ms,95%): 240.02 err/s: 0.00 reconn/s: 0.00 [ 17s ] thds: 1024 tps: 34725.13 qps: 555273.09 (r/w/o: 485875.83/0.00/69397.26) lat (ms,95%): 32.53 err/s: 0.00 reconn/s: 0.00 [ 18s ] thds: 1024 tps: 35106.95 qps: 561789.27 (r/w/o: 491543.36/0.00/70245.91) lat (ms,95%): 31.94 err/s: 0.00 reconn/s: 0.00 [ 19s ] thds: 1024 tps: 35297.31 qps: 564603.88 (r/w/o: 494006.27/0.00/70597.61) lat (ms,95%): 31.94 err/s: 0.00 reconn/s: 0.00 [ 20s ] thds: 1024 tps: 35678.97 qps: 570236.52 (r/w/o: 498881.58/0.00/71354.94) lat (ms,95%): 31.94 err/s: 0.00 reconn/s: 0.00 Note how during 1-13 seconds the machine was too busy to even print stats, which clearly indicates a scalability issue. I've never seen that behavior with 5.7 client libraries, even in high-concurrency benchmarks on machines with many cares, which was the case here. So I started digging, and this is what I found. During the initial stage sysbench was contending on mprotect() system calls from malloc(): # pt-pmp -p $(pidof sysbench) 397 mprotect,grow_heap,sysmalloc,_int_malloc,__GI___libc_malloc,libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start 397 __memcpy_falkor,libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start 112 sysmalloc,_int_malloc,__GI___libc_malloc,libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start 66 libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start 12 __memcpy_falkor,libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::?? 8 mprotect,::??,::??,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start 7 sysmalloc,_int_malloc,__GI___libc_malloc,libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::?? 7 mprotect,grow_heap,sysmalloc,_int_malloc,__GI___libc_malloc,libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::?? 4 mprotect,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start 4 ::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start 3 libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::?? 1 sb_histogram_update,sb_event_stop,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start 1 sb_histogram_get_pct_intermediate,_start 1 __pthread_clockjoin_ex,sb_thread_join_workers,main 1 ::??,::??,::??,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start 1 ::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start 1 ::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start 1 ::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start 1 __lll_lock_wait_private,_int_free,libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_execute(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start 1 libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start I can understand contention in this case -- it is likely related to an old scalability issue in the Linux kernel where concurrent mmap() / mprotect() calls from multiple threads in a single process are serialized (see https://lkml.org/lkml/2013/1/2/174). The weird part was that I've never seen that with 5.7 client libs. It turns out the allocation logic has been changed in 8.0. In 5.7 some memory is preallocated on connection creation on connection's MEM_ROOT, see the pre_alloc_size parameter in init_alloc_root(). So sysbench would preallocate memory on connection creation before the actual benchmark is started. In 8.0 that parameter to init_alloc_root() is ignored. So allocation happens during the initial stage of the benchmark, while malloc() is creating per-thread caches, causing the mmap()/mprotect() bottleneck. The commit that introduced that change has the following to say: --- Author: Steinar H. Gunderson <steinar.gunderson@oracle.com> AuthorDate: Thu Sep 28 12:57:22 2017 +0200 Bug #26940369: REWRITE THE MEM_ROOT Streamline and optimize the MEM_ROOT. ... - Preallocation was hardly ever used and caused a fair amount of complexity, so it has been removed. ... sysbench tests show somewhere between 1% and 6% qps increase. --- I can believe the change might be beneficial for long-running benchmarks where that drop on the initial stage is averaged out, but it certainly hinders short-running benchmarks by killing client scalability on the initial stage, which is how the problem was detected. A lot of people probably get skewed benchmark numbers in their high-concurrency 8.0 benchmarks because of that change, and it's not clear how to resolve it without restoring the previous behavior, at least for the client library. Details about the system: ARM, 64 cores, Ubuntu 20.04. Replacing the glibc malloc() with jemalloc improves the situation a little, but not too much -- any allocator has to create a cache of allocations on the initial stage by requesting them from the kernel via mmap() and friends. Details about the benchmark: sysbench OLTP RO @ 1024 threads. I realize the chances for this to be fixed are pretty thin. But I wanted to report it anyway, so at least it is a known problem. How to repeat: See above. Suggested fix: Restore the preallocation behavior in MEM_ROOT.