MySQL Bugs: #97822: buf_page_get_gen buf_pool->stat.n_page

Bug #97822	buf_page_get_gen buf_pool->stat.n_page_gets++ is a cpu waste
Submitted:	27 Nov 2019 23:58	Modified:	28 Jan 2020 2:23
Reporter:	Daniel Black	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	5.7.28	OS:	Linux (Centos7)
Assigned to:		CPU Architecture:	Any

Description:
Running TPCCRunner test on a x86_64, 40 core E5-2680, Centos 7 server

pid=$(pidof mysqld) ; perf record -F 10  -o mysql-${name}.perf  -p $pid -- sleep 20 ;   perf record -F 10 -g -o mysql-${name}.g.perf

A perf recording shows that buf_page_get_gen has a high CPU usage. The call stack to it it quick 

(p.s ignoring native_write_msr_safe - seems to be as a consequence of the performance recording)

Samples: 51K of event 'cycles:ppp', Event count (approx.): 4851376331525430918
  Overhead  Command  Shared Object        Symbol                                                                                                                                                                                                   ◆
+  189.04%  mysqld   [kernel.kallsyms]    [k] native_write_msr_safe                                                                                                                                                                                ▒
+   12.60%  mysqld   mysqld               [.] rec_get_offsets_func                                                                                                                                                                                 ▒
+   12.39%  mysqld   mysqld               [.] MYSQLparse                                                                                                                                                                                           ▒
+    5.93%  mysqld   mysqld               [.] page_cur_search_with_match                                                                                                                                                                           ▒
-    4.62%  mysqld   mysqld               [.] buf_page_get_gen                                                                                                                                                                                     +    4.00%  mysqld   mysqld               [.] cmp_dtuple_rec_with_match_low                                                                                                                                                                        ▒
+    3.68%  mysqld   mysqld               [.] btr_cur_search_to_nth_level                       

A disassembly of this profile shows:
       │      Disassembly of section .text:
       │
       │      0000000001096de0 <buf_page_get_gen(page_id_t const&, page_size_t const&, unsigned long, buf_block_t*, unsigned long, char const*, unsigned long, mtr_t*, bool)>:
       │      _Z16buf_page_get_genRK9page_id_tRK11page_size_tmP11buf_block_tmPKcmP5mtr_tb():
       │              ulint                   mode,
       │              const char*             file,
       │              ulint                   line,
       │              mtr_t*                  mtr,
       │              bool                    dirty_with_no_latch)
       │      {
  0.96 │        push   %rbp
       │        mov    %rsp,%rbp
       │        push   %r15
       │        push   %r14
       │        mov    %rcx,%r14
       │        push   %r13
       │        mov    %rdi,%r13
       │        push   %r12
  0.48 │        push   %rbx
       │        sub    $0x1c8,%rsp
       │        mov    0x20(%rbp),%eax
  0.96 │        mov    %rdx,-0x1d8(%rbp)
       │        mov    (%rdi),%edx
       │        mov    %rsi,-0x1d0(%rbp)
       │        mov    %eax,-0x1dc(%rbp)
       │        mov    0x4(%rdi),%eax
       │        mov    %edx,%ecx
       │        mov    %r8,-0x1c0(%rbp)
  0.48 │        shr    $0x6,%eax
       │        shl    $0x14,%ecx
  0.48 │        mov    %r9,-0x1e8(%rbp)
       │        add    %ecx,%eax
  0.48 │        add    %edx,%eax
       │        xor    %edx,%edx
  2.40 │        divq   0xc2e98f(%rip)        # 1cc57c8 <srv_buf_pool_instances>
       │        lea    (%rdx,%rdx,8),%rax
       │        lea    (%rdx,%rax,2),%r15
  0.96 │        shl    $0x6,%r15
  0.48 │        add    buf_pool_ptr,%r15
 18.27 │        addq   $0x1,0x198(%r15)
       │        mov    0x8(%rdi),%rax

The addq $0x1 corresponds to buf_pool->stat.n_page_gets++; at the top of the buf_page_get_gen.

So 18.27% of 4.62% buf_page_get_gen hence just under 1% of cpu time was spend incrementing this counter.

I suspect this is so slow is because even with large number of buffer pool instances, this counter will get hit by all threads and and therefore becoming and in-cpu slow point.

How to repeat:
TPCCRunner test:

$  conf/example/mysql/loader.properties 
driver=com.mysql.jdbc.Driver
url=jdbc:mysql://localhost/tpcc
user=root
password=cutpcctool
threads=32
warehouses=1000

$ cat conf/example/mysql/slaveXX.properties 
name=SLAVE_ID
masterAddress=127.0.0.1
masterPort=27891

driver=com.mysql.jdbc.Driver
url=jdbc:mysql://$remotehost/tpcc
user=tpcc
password=cu
poolSize=12

userCount=10
warehouseCount=CNT_WID
startWarehouseID=START_WID

60 slaves 

Fairly standard config with a large number of innodb_buffer_pool_instances in an attempt to reduce this congestion.

#########################################################
[mysqld]
user=root
datadir = /home/dbtest
socket=/home/dbtest_tmp/mysql.sock
tmpdir=/home/dbtest_tmp
log-error=/home/dbtest_tmp/error.log

# DGB - for testing - quicker startups at expense of slower shutdowns
innodb_fast_shutdown=0

autocommit = 1
back_log = 128
binlog_cache_size = 1M
binlog_format = ROW
default_storage_engine=InnoDB
innodb_data_file_path = ibdata1:1000M:autoextend
innodb_log_buffer_size=16M
innodb_log_file_size=4096M
innodb_log_files_in_group = 5
slow_query_log = 1
sync_binlog = 1
transaction_isolation = READ-COMMITTED

innodb_file_per_table = true
lower_case_table_names = 1
skip-external-locking
low_priority_updates=1
skip-name-resolve
memlock
net_read_timeout=3600
net_write_timeout=3600
max_connections = 3000
key_buffer_size = 16384M
join_buffer_size = 4096M
sort_buffer_size = 4096M
read_buffer_size=2147479552
read_rnd_buffer_size=2147483647
table_open_cache = 4096
table_open_cache_instances=64
innodb_open_files=1000
innodb_flush_method = O_DIRECT ## O_DIRECT_NO_FSYNC
innodb_flush_log_at_trx_commit = 2
innodb_buffer_pool_size = 128G
innodb_buffer_pool_instances = 64
innodb_checksum_algorithm=none
innodb_doublewrite=0
innodb_thread_concurrency=0
innodb_thread_concurrency = 0
innodb_adaptive_flushing = 1
innodb_max_dirty_pages_pct=50
innodb_use_native_aio= 1
innodb_purge_threads = 4
innodb_read_io_threads = 64
innodb_write_io_threads = 16
innodb_io_capacity = 20000
innodb_lru_scan_depth=20000
innodb-numa-interleave=0
innodb_adaptive_hash_index=0
innodb-adaptive-hash-index-parts=64
innodb_monitor_enable='%';
performance_schema=OFF
innodb_max_dirty_pages_pct=90
#general_log=on

[mysqld-5.7]
query_cache_size = 0
query_cache_type = 0
read_only = 0

Suggested fix:
buf_pool->stat.n_page_gets gets used in the information schema, SHOW ENGINE INNODB STATUS, and in the global status innodb_buffer_pool_read_requests.

I'm not sure a DBA could make an decision based on this statistic.

I recommend removing it however maybe that isn't acceptable in 5.7 (or 8.0?).

Note - alternatives below only loosely considered.

alternative:

Move the increment to below some of the lock acquisitions as a few lines later. The loop seems to be just for a page-in.

alternate 2:

move increment to before the return(NULL) (mode == BUF_GET_IF_IN_POOL
                    || mode == BUF_PEEK_IF_IN_POOL
                    || mode == BUF_GET_IF_IN_POOL_OR_WATCH)

and copy to after got_block:

The rational for moving it after the locks is the other stats in adjacent memory locations seem to be updated while holding a lock so this could cause less conflicts.

Hello Mr. Black,

Thank you for your performance improvement suggestion.

I like it a lot.

Verified as reported.

Posted by developer:
 
Fixed as of the upcoming 8.0.20 release, and here's the changelog entry:

An internal buffer pool statistics counter (n_page_gets) was partitioned
by page number to avoid contention when accessed by multiple threads.

Thank you, Daniel.

Thank you Daniel P