Bug #113564 write-only mode performance decreases by 15% in heavy workload
Submitted: 5 Jan 10:49 Modified: 11 Jun 5:32
Reporter: holmer deng Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S5 (Performance)
Version:8.0 OS:Any (>= glibc 2.25 )
Assigned to: CPU Architecture:Any
Tags: glibc pthread_mutex_cond_lock pthread_cond_wait

[5 Jan 10:49] holmer deng
Description:
Under heavy workloads, write-only mode performance decreases by at least 15%.

sysbench_test:  write-only(256 threads)
OS	        glibc  Mysql	TPS(avg)
ubuntu20.0.4	2.31  8.0.25	10661
ubuntu20.0.4	2.31  8.0.35	10971
ubuntu20.0.4	2.31  5.7.22	15213
ubuntu16.0.4	2.23  8.0.25	14049
centos6.10	2.20  8.0.25	12334
centos6.10	2.20  5.7.22	13982

perf profiler(os: ubunbu 20.04(glibc 2.31)  mysqld:8.0.35 ):
  25.30%  libpthread-2.31.so   [.] __pthread_mutex_cond_lock
   2.97%  libpthread-2.31.so   [.] __pthread_mutex_lock
   1.80%  mysqld               [.] log_writer
   1.17%  [kernel]             [k] native_queued_spin_lock_slowpath
   1.17%  mysqld               [.] MYSQLparse
   1.11%  libc-2.31.so         [.] malloc
   0.98%  libc-2.31.so         [.] __memmove_avx_unaligned_erms
   0.91%  mysqld               [.] mutex_enter_inline<PolicyMutex<TTASEventMutex<GenericPolicy> > >
   0.81%  [kernel]             [k] _raw_spin_lock
   0.80%  mysqld               [.] log_advance_ready_for_write_lsn
   0.77%  [kernel]             [k] psi_task_change
   0.50%  libpthread-2.31.so   [.] pthread_cond_wait@@GLIBC_2.3.2
   0.50%  [kernel]             [k] _raw_spin_lock_irqsave
   0.45%  libpthread-2.31.so   [.] __pthread_mutex_unlock
   0.42%  mysqld               [.] ut_delay
   0.41%  [kernel]             [k] update_load_avg
   0.38%  [kernel]             [k] switch_fpu_return
   0.37%  mysqld               [.] Writeset_trx_dependency_tracker::get_dependency
   0.35%  mysqld               [.] Buf_fetch<Buf_fetch_normal>::single_page
   0.34%  [kernel]             [k] available_idle_cpu
   0.34%  libpthread-2.31.so   [.] __lll_lock_wait
   0.33%  mysqld               [.] rw_lock_s_lock
   0.33%  [kernel]             [k] __sched_text_start
   0.30%  mysqld               [.] buf_page_hash_get_low
   0.30%  libc-2.31.so         [.] _int_free
   0.30%  mysqld               [.] hardware::crc32_using_pclmul
   0.29%  mysqld               [.] PFS_buffer_scalable_container<PFS_metadata_lock, 1024, 1024, PFS_buffer_default_array<PFS_metadata_lock>, PFS_buffer_default_alloc
   0.29%  libc-2.31.so         [.] cfree@GLIBC_2.2.5
   0.28%  mysqld               [.] mtr_t::Command::add_dirty_blocks_to_flush_list
   0.28%  mysqld               [.] my_charpos_mb3
   0.28%  mysqld               [.] log_buffer_reserve
   0.27%  [kernel]             [k] get_futex_value_locked
   0.26%  mysqld               [.] my_lfind
   0.26%  libc-2.31.so         [.] malloc_consolidate
   0.25%  mysqld               [.] PFS_memory_safe_stat::count_alloc
   0.24%  [kernel]             [k] update_curr
   0.24%  libpthread-2.31.so   [.] __pthread_rwlock_rdlock
   0.24%  libc-2.31.so         [.] _int_malloc

We compared the differences between versions 2.20, 2.23, and 2.31 and found differences in the implementation of __pthread_cond_wait, which is where __pthread_mutex_cond_lock is called.

I'm not sure if it's a problem with glibc or if MySQL 8.0 is compatible with higher versions of glibc, think the latter.

How to repeat:
sysbench benchmarks on different glibc kernel versions.

sysbench prepare

sysbench --threads=256 --time=3600 --events=999999999 --test=oltp_write_only_version1.lua --table_size=100000000 --auto_inc=off --create_secondary=off --tables=10 --db-driver=mysql --mysql-db=sbtest --mysql-host=127.0.0.1 --mysql-port=3306 --mysql-user=mysql --mysql-password=mysql --report-interval=1 run
[8 Jan 8:52] Jakub Lopuszanski
Hi! Would it be possible for you to collect flame graphs?
I see you are using perf already.
I mean something like:
```
$ perf record -p $(pgrep mysqld) --call-graph=fp -o ${version}.perf.data -- sleep 10
$ perf script -i ${version}.perf.data | stackcollapse-perf.pl > ${version}.perf.folded
```
The stackcollapse-perf.pl part is optional, as it requires https://github.com/brendangregg/FlameGraph
but very helpful as it extracts just the minimal required information to be able to draw flame graph charts.
The *.perf.data files can be quite huge otherwise, but perhaps could be compressed instead.

One goal here would be to see not just the name of a function which is causing the slowdown, but the whole call stack which lead to it being called, as these low-level functions are used in many places and contexts, and to fix a bottleneck we need to know in which module and scenario the change occurred.
```
$ flamegraph.pl <  ${version}.perf.folded >  ${version}.svg
```
Another thing we could use the *.perf.folded files is to compare two versions of the code in visual way, to see which callstacks got slower/faster:
```
$ difffolded.pl -n {$version_a,$version_b}.perf.folded | flamegraph.pl > $b-$a.diff.svg
$ difffolded.pl -n {$version_b,$version_a}.perf-folded | flamegraph.pl > $a-$b.diff.svg
```
This will probably only make sense when $version_a and $version_b are similar codebases (so comparing 8.0.* to 5.7.* will probably not make much sense).
[8 Jan 11:38] MySQL Verification Team
Hi Mr. deng,

We are waiting on your feedback.

Please, when you do profiling, do check whether the slowdown does not come from glibc itself.

This is not a problem with glibc compatibility, since the POSIX standard has not changed between all those versions. POSIX standard for threads will change during this year.

Hence, this could be glibc implementation that might be a problem itself. 

Proper profiling, as described by our esteemed colleague,  should provide the necessary info on this issue.
[9 Jan 10:52] MySQL Verification Team
Hi Mr. deng,

We have asked for much more feedback then what you have provided to us.

Please, re-read all our responses and all data that we require from you.

We shall wait until you provided us with a full feedback.

Many thanks in advance.
[10 Jan 2:16] holmer deng
Hi, I have provided moren flame diagrams according to the practice of your colleagues. 
    such as:    8.0.35_glibc2.31.svg  5.7.22_glibc2.31.svg  5.7.22_glibc2.31-8.0.35_glibc2.31.diff.svg  8.0.35_glibc2.31-5.7.22_glibc2.31.diff.svg

Sorry, I don't know what more data you need. Please list it in detail.
[10 Jan 14:32] Jakub Lopuszanski
Thank you for the flamegraphs!

Unfortunately, they show the caller of `__pthread_mutex_cond_lock()` as "[unknown]", so it is still not clear which part of the MySQL code is calling it.
All we see is all -> connection -> [unknown] -> __pthread_mutex_cond_lock.

But, I think this is sufficient to hint that the problem is not in InnoDB, because InnoDB code mostly lives in the:
connection -> start_thread -> pfs_spawn_thread -> handle_connection -> do_command 
"subtree" of the possible stacks.

Also, when InnoDB uses conditional variables, it does so via the `os_event` wrapper, which I don't see mentioned anywhere in the graphs.
This might either mean that the problem is not in InnoDB or that somehow the way perf collects stacks was wrong/incomplete.

Note that the Server layer (not just InnoDB) is also using `pthread_cond_wait`, for example in native_cond_wait(..) function, used by safe_cond_wait and my_cond_wait, which in turn is used in inline_mysql_cond_wait, which in turn is used by mysql_cond_wait and mysql_cond_wait which are used in maaaany places in the Server layer.

Therefore it is not clear to me if the problem is in Server or InnoDB, and to figure that out, one would need a full correct callstack.
[11 Jan 10:15] holmer deng
rebuild mysql version in 8.0.35 with flags(-fno-omit-frame-pointer)

Attachment: 8.0.35_glibc2.31_with_dwarf.tar (application/x-tar, text), 2.10 MiB.

[11 Jan 10:16] holmer deng
rebuild mysql version in 8.0.35 without cflgs(-fno-omit-frame-pointer)
[11 Jan 10:19] holmer deng
It still has "unknown" calls, I don't know if that affects your judgment.
[11 Jan 10:39] MySQL Verification Team
Hi Mr. deng,

Can you try building a debug version of our MySQL 8.0.35.

Or even simpler, just download our binary package for your OS and use our debug binary, that is included in all our binary distributions.
[11 Jan 15:26] Jakub Lopuszanski
What I find most puzzling is that when I compare:
* 8.0.35_glibc2.31.svg from your earlier tarball to
* 8.0.35_glibc2.31_with_dwarf.svg from your latest tarball
then they look completely different.
For example if you CTRL+F "dispatch_command", you'll see it took 49% in the former, but takes just 20% in the later.
OTOH MYSQL_BIN_LOG::commit to 10% in the first, but 46% in the later.

Is it the same mysql config (sysvars), machine, and workload?

At any rate, in the later flamegraph, most of __pthread_cond_wait time is spent in Server layer, not InnoDB layer, so I think this bug should be recategorized.
[15 Jan 12:19] holmer deng
__pthread_mutex_cond_lock no "unknow" calls

Attachment: 8.0.35_glibc2.31_dwarf_new.svg (image/svg+xml, text), 794.65 KiB.

[15 Jan 12:23] holmer deng
What I find most puzzling is that when I compare:
* 8.0.35_glibc2.31.svg from your earlier tarball to         
* 8.0.35_glibc2.31_with_dwarf.svg from your latest tarball  
then they look completely different.
For example if you CTRL+F "dispatch_command", you'll see it took 49% in the former, but takes just 20% in the later.
OTOH MYSQL_BIN_LOG::commit to 10% in the first, but 46% in the later.
    * 8.0.35_glibc2.31.svg              -> call-graph=fg
    * 8.0.35_glibc2.31_with_dwarf.svg   -> call-graph=dwarf

Is it the same mysql config (sysvars), machine, and workload?
    same mysql config, machine, and workload.
[22 Jan 13:08] Jakub Lopuszanski
Ok, so in  8.0.35_glibc2.31_dwarf_new.svg  we see that the call to `__pthread_mutex_cond_lock` which takes most time is from `Commit_stage_manager::enroll_for`

So this is not a bug in InnoDB.

What have you changed between:
8.0.35_glibc2.31_with_dwarf.svg
8.0.35_glibc2.31_dwarf_new.svg 
?
[22 Jan 14:48] Marcin Babij
There was a task named "New condvar implementation that provides stronger ordering guarantees." pushed to the 2.25 glibc.
https://patchwork.ozlabs.org/project/glibc/patch/1424456307.20941.122.camel@triegel.csb/

Most probably the slowdown is caused by either or sum of:
- this new 2.25 implementation being slower, maybe it has some internal contentions or inefficient CPU cache usage? I guess we could not do much about it.
- this new 2.25 implementation needing some configuration or option setting to perform like the old one, which it did not previously and we need to adjust our code to include that.
- we misuse the condvar infrastructure, use some undefined behavior, or use too frequently where we could easily use it less. Do we need "wake up all" feature or maybe it could be something simpler like a mutex?

Weird it is only that binlog condvar that shows this problem. At least so far I haven't seen any other similar being reported.

Is this glibc used widely, is it bound to some specific compiler or platform? If it is very common and a default choice then we may try to check points 2nd and 3rd from the list above and try to squeeze out as much as possible from our side. 
Is there an alternative that could be used? As workaround?
[23 Jan 9:11] holmer deng
######  mysql 8.0.18  glibc2.31
[ 266s ] thds: 256 tps: 14976.74 qps: 59948.97 (r/w/o: 0.00/59948.97/0.00) lat (ms,95%): 24.38 err/s: 0.00 reconn/s: 0.00
[ 267s ] thds: 256 tps: 15816.71 qps: 63216.83 (r/w/o: 0.00/63215.83/1.00) lat (ms,95%): 20.00 err/s: 0.00 reconn/s: 0.00
[ 268s ] thds: 256 tps: 15529.28 qps: 62138.13 (r/w/o: 0.00/62138.13/0.00) lat (ms,95%): 21.50 err/s: 0.00 reconn/s: 0.00
[ 269s ] thds: 256 tps: 14883.06 qps: 59544.24 (r/w/o: 0.00/59543.24/1.00) lat (ms,95%): 22.28 err/s: 0.00 reconn/s: 0.00
[ 270s ] thds: 256 tps: 15764.50 qps: 63071.99 (r/w/o: 0.00/63070.99/1.00) lat (ms,95%): 21.50 err/s: 0.00 reconn/s: 0.00
[ 271s ] thds: 256 tps: 14688.50 qps: 58676.00 (r/w/o: 0.00/58674.00/2.00) lat (ms,95%): 23.10 err/s: 0.00 reconn/s: 0.00
[ 272s ] thds: 256 tps: 14899.46 qps: 59667.83 (r/w/o: 0.00/59666.83/1.00) lat (ms,95%): 23.52 err/s: 0.00 reconn/s: 0.00
[ 273s ] thds: 256 tps: 15383.74 qps: 61546.97 (r/w/o: 0.00/61545.97/1.00) lat (ms,95%): 20.37 err/s: 0.00 reconn/s: 0.00
[ 274s ] thds: 256 tps: 14641.41 qps: 58524.65 (r/w/o: 0.00/58521.65/3.00) lat (ms,95%): 21.50 err/s: 0.00 reconn/s: 0.00
[ 275s ] thds: 256 tps: 15771.74 qps: 63106.95 (r/w/o: 0.00/63106.95/0.00) lat (ms,95%): 18.28 err/s: 0.00 reconn/s: 0.00
[ 276s ] thds: 256 tps: 16053.49 qps: 64149.95 (r/w/o: 0.00/64148.95/1.00) lat (ms,95%): 18.61 err/s: 0.00 reconn/s: 0.00
[ 277s ] thds: 256 tps: 15201.80 qps: 60875.18 (r/w/o: 0.00/60874.18/1.00) lat (ms,95%): 20.74 err/s: 0.00 reconn/s: 0.00
[ 278s ] thds: 256 tps: 15835.84 qps: 63318.34 (r/w/o: 0.00/63314.35/4.00) lat (ms,95%): 19.29 err/s: 0.00 reconn/s: 0.00
[ 279s ] thds: 256 tps: 15246.87 qps: 60979.46 (r/w/o: 0.00/60977.46/2.00) lat (ms,95%): 20.37 err/s: 0.00 reconn/s: 0.00
[ 280s ] thds: 256 tps: 15902.87 qps: 63604.47 (r/w/o: 0.00/63603.47/1.00) lat (ms,95%): 19.29 err/s: 0.00 reconn/s: 0.00

######  mysql 8.0.19  glibc2.31
[ 266s ] thds: 256 tps: 12171.69 qps: 48667.75 (r/w/o: 0.00/48664.75/3.00) lat (ms,95%): 28.67 err/s: 0.00 reconn/s: 0.00
[ 267s ] thds: 256 tps: 12314.77 qps: 49273.09 (r/w/o: 0.00/49273.09/0.00) lat (ms,95%): 27.17 err/s: 0.00 reconn/s: 0.00
[ 268s ] thds: 256 tps: 12005.99 qps: 48015.97 (r/w/o: 0.00/48015.97/0.00) lat (ms,95%): 27.66 err/s: 0.00 reconn/s: 0.00
[ 269s ] thds: 256 tps: 11966.33 qps: 47867.31 (r/w/o: 0.00/47866.31/1.00) lat (ms,95%): 27.66 err/s: 0.00 reconn/s: 0.00
[ 270s ] thds: 256 tps: 12177.89 qps: 48744.57 (r/w/o: 0.00/48744.57/0.00) lat (ms,95%): 27.17 err/s: 0.00 reconn/s: 0.00
[ 271s ] thds: 256 tps: 11694.25 qps: 46747.02 (r/w/o: 0.00/46746.02/1.00) lat (ms,95%): 28.16 err/s: 0.00 reconn/s: 0.00
[ 272s ] thds: 256 tps: 12385.59 qps: 49526.39 (r/w/o: 0.00/49525.39/1.00) lat (ms,95%): 26.20 err/s: 0.00 reconn/s: 0.00
[ 273s ] thds: 256 tps: 12258.11 qps: 49030.46 (r/w/o: 0.00/49030.46/0.00) lat (ms,95%): 26.68 err/s: 0.00 reconn/s: 0.00
[ 274s ] thds: 256 tps: 11910.03 qps: 47662.12 (r/w/o: 0.00/47662.12/0.00) lat (ms,95%): 27.66 err/s: 0.00 reconn/s: 0.00
[ 275s ] thds: 256 tps: 12452.04 qps: 49788.17 (r/w/o: 0.00/49786.17/2.00) lat (ms,95%): 27.17 err/s: 0.00 reconn/s: 0.00
[ 276s ] thds: 256 tps: 12392.02 qps: 49592.05 (r/w/o: 0.00/49590.05/2.00) lat (ms,95%): 26.20 err/s: 0.00 reconn/s: 0.00
[ 277s ] thds: 256 tps: 12275.04 qps: 49100.16 (r/w/o: 0.00/49100.16/0.00) lat (ms,95%): 26.68 err/s: 0.00 reconn/s: 0.00
[ 278s ] thds: 256 tps: 11971.66 qps: 47899.64 (r/w/o: 0.00/47899.64/0.00) lat (ms,95%): 29.19 err/s: 0.00 reconn/s: 0.00
[ 279s ] thds: 256 tps: 12238.29 qps: 48905.12 (r/w/o: 0.00/48905.12/0.00) lat (ms,95%): 28.16 err/s: 0.00 reconn/s: 0.00
[ 280s ] thds: 256 tps: 12022.92 qps: 48136.66 (r/w/o: 0.00/48136.66/0.00) lat (ms,95%): 29.72 err/s: 0.00 reconn/s: 0.00

As shown above, I tested both versions 8.0.18 and 8.0.19 and found that 8.0.18 had no problems, while 8.0.19 had performance degradation issues.  Through 8.0.19 update log on the WL: https://dev.mysql.com/worklog/task/?id=7846.  Suspect this update is to blame?
[23 Jan 9:12] holmer deng
mysql8.0.18 vs mysql8.0.19 flame graphs

Attachment: mysql8.0.18 vs mysql8.0.19 flame graphs.tar (application/x-tar, text), 2.27 MiB.

[23 Jan 11:12] MySQL Verification Team
Hi Mr. deng,

We do not think that the patch that you have outlined has anything to do with glibc version at all. 

Also, for your information, our current version is 8.0.36 and not 8.0.19.
[23 Jan 12:18] holmer deng
I know, what I'm trying to say is that this problem started with MySQL 8.0.19 and did not occur in versions prior to 8.0.18.

Look at my test examples above: in the same version of glibc(2.31), 8.0.18 had no problems, while 8.0.19 had performance issues.
[23 Jan 12:48] MySQL Verification Team
Hi,

We have recently tested 8.0.36 with glibc 2.35 and found no slowdown what so ever, compared with glibc 2.21.
[24 Jan 4:03] holmer deng
I tested 8.0.36 with glibc 2.31 and found that there were still slowdown.

And I also tested other versions of mysql.

### sysbench threads=256  write_only  same machine, same glibc, same config
slowdown version:
mysql-8.0.19
mysql-8.0.20
mysql-8.0.25
mysql-8.0.35
mysql-8.0.36

no slowdown
mysql-8.0.11
mysql-8.0.15
mysql-8.0.18

The previous tests showed that slowdown started with MySQL 8.0.19.
[24 Jan 4:05] holmer deng
machine infos

Attachment: mysql 8.0.18 vs 8.0.36 slowdown.tar (application/x-tar, text), 2.46 MiB.

[24 Jan 15:00] Sven Sandberg
Thank you, interesting analysis. I confirm that WL#7846 did change code executed within `Commit_stage_manager::enroll_for`, so there is evidence that this worklog introduced the regression.

However, we only expect that WL#7846 has any visible impact on replicas that use slave-preserve-commit-order and have disabled binary log. Hence the following questions:

- Is this a single server? And if not:
  - What is the replication topology?
  - Which servers have client workloads?
  - Is the binlog enabled?
  - What are the values for @@global.slave_parallel_workers, 
    @@global.slave_parallel_type, @@global.slave_preserve_commit_order?

Also, since you are bisecting already, would you be able to check the exact commit of WL#7846? (i.e., https://github.com/mysql/mysql-server/commit/5beea8f7152abc8905cf3f220a17f29014ef1464 )
[25 Jan 3:30] holmer deng
Is a single server.
[11 Jun 5:32] MySQL Verification Team
Thank you for the feedback.

regards,
Umesh
[11 Jun 9:41] Jakub Lopuszanski
Hello Holmer Deng!
We want to reproduce this issue locally.
Can you please share the contents of oltp_write_only_version1.lua ?