Bug #99813 Restore preallocation logic in 8.0 client library
Submitted: 8 Jun 2020 20:50 Modified: 9 Jun 2020 14:36
Reporter: Alexey Kopytov Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: C API (client library) Severity:S5 (Performance)
Version:8.0 OS:Any
Assigned to: CPU Architecture:Any

[8 Jun 2020 20:50] Alexey Kopytov
Description:
I was running sysbench benchmarks with the 8.0 client library and
noticed that sysbench takes an unusually long time to "warm up":

[ 1s ] thds: 1024 tps: 76.28 qps: 2262.99 (r/w/o: 2026.93/0.00/236.06) lat (ms,95%): 8638.96 err/s: 0.00 reconn/s: 0.00
[ 13s ] thds: 1024 tps: 15443.84 qps: 245703.40 (r/w/o: 214793.65/0.00/30909.74) lat (ms,95%): 1032.01 err/s: 0.00 reconn/s: 0.00
[ 14s ] thds: 1024 tps: 10660.46 qps: 171515.46 (r/w/o: 150188.53/0.00/21326.93) lat (ms,95%): 337.94 err/s: 0.00 reconn/s: 0.00
[ 15s ] thds: 1024 tps: 6271.36 qps: 99505.91 (r/w/o: 86959.18/0.00/12546.73) lat (ms,95%): 484.44 err/s: 0.00 reconn/s: 0.00
[ 16s ] thds: 1024 tps: 20259.38 qps: 320745.02 (r/w/o: 280335.26/0.00/40409.76) lat (ms,95%): 240.02 err/s: 0.00 reconn/s: 0.00
[ 17s ] thds: 1024 tps: 34725.13 qps: 555273.09 (r/w/o: 485875.83/0.00/69397.26) lat (ms,95%): 32.53 err/s: 0.00 reconn/s: 0.00
[ 18s ] thds: 1024 tps: 35106.95 qps: 561789.27 (r/w/o: 491543.36/0.00/70245.91) lat (ms,95%): 31.94 err/s: 0.00 reconn/s: 0.00
[ 19s ] thds: 1024 tps: 35297.31 qps: 564603.88 (r/w/o: 494006.27/0.00/70597.61) lat (ms,95%): 31.94 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 1024 tps: 35678.97 qps: 570236.52 (r/w/o: 498881.58/0.00/71354.94) lat (ms,95%): 31.94 err/s: 0.00 reconn/s: 0.00

Note how during 1-13 seconds the machine was too busy to even print
stats, which clearly indicates a scalability issue. I've never seen that
behavior with 5.7 client libraries, even in high-concurrency benchmarks
on machines with many cares, which was the case here. So I started
digging, and this is what I found.

During the initial stage sysbench was contending on mprotect() system calls
from malloc():

# pt-pmp -p $(pidof sysbench)
    397 mprotect,grow_heap,sysmalloc,_int_malloc,__GI___libc_malloc,libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
    397 __memcpy_falkor,libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
    112 sysmalloc,_int_malloc,__GI___libc_malloc,libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
     66 libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
     12 __memcpy_falkor,libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??
      8 mprotect,::??,::??,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
      7 sysmalloc,_int_malloc,__GI___libc_malloc,libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??
      7 mprotect,grow_heap,sysmalloc,_int_malloc,__GI___libc_malloc,libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??
      4 mprotect,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
      4 ::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
      3 libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??
      1 sb_histogram_update,sb_event_stop,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
      1 sb_histogram_get_pct_intermediate,_start
      1 __pthread_clockjoin_ex,sb_thread_join_workers,main
      1 ::??,::??,::??,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
      1 ::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
      1 ::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
      1 ::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
      1 __lll_lock_wait_private,_int_free,libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_execute(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start
      1 libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),libmysqlclient::??(libmysqlclient.so.21),mysql_stmt_store_result(libmysqlclient.so.21),::??,db_execute,::??,::??,::??,::??,lua_pcall,::??,::??,start_thread,thread_start

I can understand contention in this case -- it is likely related to an
old scalability issue in the Linux kernel where concurrent mmap() /
mprotect() calls
from multiple threads in a single process are serialized (see
https://lkml.org/lkml/2013/1/2/174).

The weird part was that I've never seen that with 5.7 client libs.

It turns out the allocation logic has been changed in 8.0. In 5.7 some
memory is preallocated on connection creation on connection's MEM_ROOT,
see the pre_alloc_size parameter in init_alloc_root(). So sysbench would
preallocate memory on connection creation before the actual benchmark is
started.

In 8.0 that parameter to init_alloc_root() is ignored. So allocation
happens during the initial stage of the benchmark, while malloc() is
creating per-thread caches, causing the mmap()/mprotect()
bottleneck. The commit that introduced that change has the following to
say:

---
Author:     Steinar H. Gunderson <steinar.gunderson@oracle.com>
AuthorDate: Thu Sep 28 12:57:22 2017 +0200

Bug #26940369: REWRITE THE MEM_ROOT

Streamline and optimize the MEM_ROOT.
...
 - Preallocation was hardly ever used and caused a fair amount of complexity,
   so it has been removed.
...
sysbench tests show somewhere between 1% and 6% qps increase.
---

I can believe the change might be beneficial for long-running benchmarks
where that drop on the initial stage is averaged out, but it certainly
hinders short-running benchmarks by killing client scalability on the
initial stage, which is how the problem was detected.

A lot of people probably get skewed benchmark numbers in their
high-concurrency 8.0 benchmarks because of that change, and it's not
clear how to resolve it without restoring the previous behavior, at
least for the client library.

Details about the system: ARM, 64 cores, Ubuntu 20.04. Replacing the
glibc malloc() with jemalloc improves the situation a little, but not
too much -- any allocator has to create a cache of allocations on the
initial stage by requesting them from the kernel via mmap() and friends.

Details about the benchmark: sysbench OLTP RO @ 1024 threads.

I realize the chances for this to be fixed are pretty thin. But I wanted
to report it anyway, so at least it is a known problem.

How to repeat:
See above.

Suggested fix:
Restore the preallocation behavior in MEM_ROOT.
[9 Jun 2020 13:25] MySQL Verification Team
Hi Kaamos,

Thank you very much for your bug report.

We have taken a look at the patch and it seems that you are correct.

However, we would very much like to repeat your results. We need feedback from you on:

* exact set of sysbench prepare/run options that you use 

* results from some generally available profiler and how did you run it exactly on the sysbench.

Many thanks in advance.
[9 Jun 2020 14:33] Alexey Kopytov
Hi MySQL Verification Team!!!11

Please find the requested info below:

- the command to prepare was:

sysbench oltp_read_only prepare --threads=10 --tables=10 --table-size=1000000

- the command to run the benchmark was:

sysbench oltp_read_only run --threads=1024 --tables=10 --table-size=1000000 --rand-type=uniform --report-interval=1 --time=60 --histogram

- profiling results with pt-pmp
  (https://www.percona.com/doc/percona-toolkit/LATEST/pt-pmp.html) can
  be found in my original report

As I mentioned previously, this was on an ARM machine, but I don't think
it is that important. I believe what is important to reproduce here is a
high number of CPU cores.

Another important detail: the problem only occurs with prepared
statements. As soon as I disable them by adding --db-ps-mode=disable to
the sysbench command line, the problem disappears. I didn't have time to
analyze it further.
[9 Jun 2020 14:36] MySQL Verification Team
Hi Kaamos,

I have been able to repeat the behaviour with "Instruments" from XCode.

Verified as reported.