Description:
Running TPCCRunner test on a x86_64, 40 core E5-2680, Centos 7 server
pid=$(pidof mysqld) ; perf record -F 10 -o mysql-${name}.perf -p $pid -- sleep 20 ; perf record -F 10 -g -o mysql-${name}.g.perf
A perf recording shows that buf_page_get_gen has a high CPU usage. The call stack to it it quick
(p.s ignoring native_write_msr_safe - seems to be as a consequence of the performance recording)
Samples: 51K of event 'cycles:ppp', Event count (approx.): 4851376331525430918
Overhead Command Shared Object Symbol ◆
+ 189.04% mysqld [kernel.kallsyms] [k] native_write_msr_safe ▒
+ 12.60% mysqld mysqld [.] rec_get_offsets_func ▒
+ 12.39% mysqld mysqld [.] MYSQLparse ▒
+ 5.93% mysqld mysqld [.] page_cur_search_with_match ▒
- 4.62% mysqld mysqld [.] buf_page_get_gen + 4.00% mysqld mysqld [.] cmp_dtuple_rec_with_match_low ▒
+ 3.68% mysqld mysqld [.] btr_cur_search_to_nth_level
A disassembly of this profile shows:
│ Disassembly of section .text:
│
│ 0000000001096de0 <buf_page_get_gen(page_id_t const&, page_size_t const&, unsigned long, buf_block_t*, unsigned long, char const*, unsigned long, mtr_t*, bool)>:
│ _Z16buf_page_get_genRK9page_id_tRK11page_size_tmP11buf_block_tmPKcmP5mtr_tb():
│ ulint mode,
│ const char* file,
│ ulint line,
│ mtr_t* mtr,
│ bool dirty_with_no_latch)
│ {
0.96 │ push %rbp
│ mov %rsp,%rbp
│ push %r15
│ push %r14
│ mov %rcx,%r14
│ push %r13
│ mov %rdi,%r13
│ push %r12
0.48 │ push %rbx
│ sub $0x1c8,%rsp
│ mov 0x20(%rbp),%eax
0.96 │ mov %rdx,-0x1d8(%rbp)
│ mov (%rdi),%edx
│ mov %rsi,-0x1d0(%rbp)
│ mov %eax,-0x1dc(%rbp)
│ mov 0x4(%rdi),%eax
│ mov %edx,%ecx
│ mov %r8,-0x1c0(%rbp)
0.48 │ shr $0x6,%eax
│ shl $0x14,%ecx
0.48 │ mov %r9,-0x1e8(%rbp)
│ add %ecx,%eax
0.48 │ add %edx,%eax
│ xor %edx,%edx
2.40 │ divq 0xc2e98f(%rip) # 1cc57c8 <srv_buf_pool_instances>
│ lea (%rdx,%rdx,8),%rax
│ lea (%rdx,%rax,2),%r15
0.96 │ shl $0x6,%r15
0.48 │ add buf_pool_ptr,%r15
18.27 │ addq $0x1,0x198(%r15)
│ mov 0x8(%rdi),%rax
The addq $0x1 corresponds to buf_pool->stat.n_page_gets++; at the top of the buf_page_get_gen.
So 18.27% of 4.62% buf_page_get_gen hence just under 1% of cpu time was spend incrementing this counter.
I suspect this is so slow is because even with large number of buffer pool instances, this counter will get hit by all threads and and therefore becoming and in-cpu slow point.
How to repeat:
TPCCRunner test:
$ conf/example/mysql/loader.properties
driver=com.mysql.jdbc.Driver
url=jdbc:mysql://localhost/tpcc
user=root
password=cutpcctool
threads=32
warehouses=1000
$ cat conf/example/mysql/slaveXX.properties
name=SLAVE_ID
masterAddress=127.0.0.1
masterPort=27891
driver=com.mysql.jdbc.Driver
url=jdbc:mysql://$remotehost/tpcc
user=tpcc
password=cu
poolSize=12
userCount=10
warehouseCount=CNT_WID
startWarehouseID=START_WID
60 slaves
Fairly standard config with a large number of innodb_buffer_pool_instances in an attempt to reduce this congestion.
#########################################################
[mysqld]
user=root
datadir = /home/dbtest
socket=/home/dbtest_tmp/mysql.sock
tmpdir=/home/dbtest_tmp
log-error=/home/dbtest_tmp/error.log
# DGB - for testing - quicker startups at expense of slower shutdowns
innodb_fast_shutdown=0
autocommit = 1
back_log = 128
binlog_cache_size = 1M
binlog_format = ROW
default_storage_engine=InnoDB
innodb_data_file_path = ibdata1:1000M:autoextend
innodb_log_buffer_size=16M
innodb_log_file_size=4096M
innodb_log_files_in_group = 5
slow_query_log = 1
sync_binlog = 1
transaction_isolation = READ-COMMITTED
innodb_file_per_table = true
lower_case_table_names = 1
skip-external-locking
low_priority_updates=1
skip-name-resolve
memlock
net_read_timeout=3600
net_write_timeout=3600
max_connections = 3000
key_buffer_size = 16384M
join_buffer_size = 4096M
sort_buffer_size = 4096M
read_buffer_size=2147479552
read_rnd_buffer_size=2147483647
table_open_cache = 4096
table_open_cache_instances=64
innodb_open_files=1000
innodb_flush_method = O_DIRECT ## O_DIRECT_NO_FSYNC
innodb_flush_log_at_trx_commit = 2
innodb_buffer_pool_size = 128G
innodb_buffer_pool_instances = 64
innodb_checksum_algorithm=none
innodb_doublewrite=0
innodb_thread_concurrency=0
innodb_thread_concurrency = 0
innodb_adaptive_flushing = 1
innodb_max_dirty_pages_pct=50
innodb_use_native_aio= 1
innodb_purge_threads = 4
innodb_read_io_threads = 64
innodb_write_io_threads = 16
innodb_io_capacity = 20000
innodb_lru_scan_depth=20000
innodb-numa-interleave=0
innodb_adaptive_hash_index=0
innodb-adaptive-hash-index-parts=64
innodb_monitor_enable='%';
performance_schema=OFF
innodb_max_dirty_pages_pct=90
#general_log=on
[mysqld-5.7]
query_cache_size = 0
query_cache_type = 0
read_only = 0
Suggested fix:
buf_pool->stat.n_page_gets gets used in the information schema, SHOW ENGINE INNODB STATUS, and in the global status innodb_buffer_pool_read_requests.
I'm not sure a DBA could make an decision based on this statistic.
I recommend removing it however maybe that isn't acceptable in 5.7 (or 8.0?).
Note - alternatives below only loosely considered.
alternative:
Move the increment to below some of the lock acquisitions as a few lines later. The loop seems to be just for a page-in.
alternate 2:
move increment to before the return(NULL) (mode == BUF_GET_IF_IN_POOL
|| mode == BUF_PEEK_IF_IN_POOL
|| mode == BUF_GET_IF_IN_POOL_OR_WATCH)
and copy to after got_block:
The rational for moving it after the locks is the other stats in adjacent memory locations seem to be updated while holding a lock so this could cause less conflicts.