Description:
High contention on InnoDB per-page BUF_BLOCK_MUTEX during concurrent range-selects to the same row/page (buf_page_get_gen() / buf_page_mutex_enter() path).
Please investigate reducing the frequency/necessity of this mutex in extreme hot-page workloads.
When many connections run the same range query against the same row, SHOW ENGINE INNODB STATUS frequently reports semaphore waits on Mutex BUF_BLOCK_MUTEX from buf0buf.cc. Even though each wait often prints as 0 seconds, the volume is high and correlates with lower throughput compared to a test that spreads reads across different rows/pages.
Example from SHOW ENGINE INNODB STATUS semaphores section:
--Thread 140408444081728 has waited at buf0buf.cc line 4520 for 0 seconds the semaphore:
Mutex at 0x7fb7a4ce16f0, Mutex BUF_BLOCK_MUTEX created buf0buf.cc:0, locked by 140398751872576
--Thread 140398731793984 has waited at buf0buf.cc line 4520 for 0 seconds the semaphore:
Mutex at 0x7fb7a4ce16f0, Mutex BUF_BLOCK_MUTEX created buf0buf.cc:0, locked by 140398723339840
Request: Investigate whether InnoDB can reduce the need to acquire the per-block mutex (or shorten critical section / use different atomic scheme) in the “same hot page, high concurrency, read-only range selects” case.
How to repeat:
Create table contention_test with enough rows (I used 30k rows per value of col_a 300 values total used on this test); ensure query uses index on col_a.
CREATE TABLE `contention_test` (
`id` int NOT NULL AUTO_INCREMENT,
`col_a` int DEFAULT NULL,
`col_b` int DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `col_a` (`col_a`)
) ENGINE=InnoDB AUTO_INCREMENT=9000002 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1) Run hot-page test (all threads hitting the same value col_a=1), with 100 concurrency and 10 iterations, for a total of 1000 queries.
$time mysqlslap --user=root --password=msandbox --host=127.0.0.1 --port=8409 \
--concurrency=100 --iterations=10 --create-schema=test \
--query="SELECT * FROM contention_test WHERE col_a = 1"
2) Run fewer contention tests (spread across different col_a values; 10 concurrent mysqlslap processes, each concurrency=10, different col_a values stepping by 25), for a total of 1000 queries:
$ time ( for i in {1..10}; do
mysqlslap --user=root --password=msandbox --host=127.0.0.1 --port=8409 \
--concurrency=10 --iterations=10 --create-schema=test \
--query="SELECT * FROM contention_test WHERE col_a = ${j}" &
j=$((j+25))
done; wait; )
3) Run fewer contention tests (spread across different col_a values; 20 concurrent mysqlslap processes, each concurrency=10, different col_a values stepping by 13), for a total of 2000 queries:
$time ( for i in {1..20}; do
mysqlslap --user=root --password=msandbox --host=127.0.0.1 --port=8409 \
--concurrency=10 --iterations=10 --create-schema=test \
--query="SELECT * FROM contention_test WHERE col_a = ${j}" &
j=$((j+13))
done; wait; )
Observed results
1) Hot-page test (100 concurrency, same col_a, 1000 queries):
real 0m31.510s
user 0m38.015s
sys 0m5.649s
2) Spread test (10 concurrent mysqlslap processes, each concurrency=10, different col_a values, 1000 queries):
real 0m20.937s
user 0m8.905s
sys 0m2.701s
3) Spread test (20 concurrent mysqlslap processes, each concurrency=10, different col_a values, 2000 queries):
real 0m30.605s
user 0m42.118s
sys 0m8.313s
We could see that doubling the number of concurrent mysqlslap processes from 10 to 20 resulted in a time distribution more similar to the hot-page test, even though the total number of queries was doubled.
Furthermore, we used bcc to check the latency of the buf_page_optimistic_get function and found that the latency is much higher in the single-range-select case.
1) single range-select case (1000 queries):
$ sudo /usr/share/bcc/tools/funclatency -p 2316565 /mysql/8.4.9/bin/mysqld:_Z23buf_page_optimistic_getmP11buf_block_tm10Page_fetchPKcmP5mtr_t
Tracing 1 functions for "/mysql/8.4.9/bin/mysqld:_Z23buf_page_optimistic_getmP11buf_block_tm10Page_fetchPKcmP5mtr_t"... Hit Ctrl-C to end.
^C
nsecs : count distribution Total nsecs based on the right column of the nsecs column
0 -> 1 : 0 | | 1*0 = 0
2 -> 3 : 0 | | 3*0 = 0
4 -> 7 : 0 | | 7*0 = 0
8 -> 15 : 0 | | 15*0 = 0
16 -> 31 : 0 | | 31*0 = 0
32 -> 63 : 0 | | 63*0 = 0
64 -> 127 : 0 | | 127*0 = 0
128 -> 255 : 0 | | 255*0 = 0
256 -> 511 : 0 | | 511*0 = 0
512 -> 1023 : 0 | | 1023*0 = 0
1024 -> 2047 : 12228 | | 2047*12228 = 25014996
2048 -> 4095 : 4306291 |********** | 4095*4306291 = 17646460045
4096 -> 8191 : 15674574 |****************************************| 8191*15674574 = 128504071094
8192 -> 16383 : 5406530 |************* | 16383*5406530 = 88479057990
16384 -> 32767 : 1676762 |**** | 32767*1676762 = 54946584734
32768 -> 65535 : 1029455 |** | 65535*1029455 = 67504803825
65536 -> 131071 : 604328 |* | 131071*604328 = 79160131488
131072 -> 262143 : 324081 | | 262143*324081 = 8495105283
262144 -> 524287 : 498806 |* | 524287*498806 = 261580561022
524288 -> 1048575 : 326693 | | 1048575*326693 = 343561213650
1048576 -> 2097151 : 123370 | | 2097151*123370 = 258862287700
2097152 -> 4194303 : 17373 | | 4194303*17373 = 728541969
4194304 -> 8388607 : 508 | | 8388607*508 = 4261412356
8388608 -> 16777215 : 1 | | 16777215*1 = 16777215
avg = 33747 nsecs, total: 1012463558244 nsecs, count: 30001000
2) multi range-select case (1000 queries):
$ sudo /usr/share/bcc/tools/funclatency -p 2316565 /home/roberto.garciadebem/opt/mysql/8.4.9/bin/mysqld:_Z23buf_page_optimistic_getmP11buf_block_tm10Page_fetchPKcmP5mtr_t
Tracing 1 functions for "/home/roberto.garciadebem/opt/mysql/8.4.9/bin/mysqld:_Z23buf_page_optimistic_getmP11buf_block_tm10Page_fetchPKcmP5mtr_t"... Hit Ctrl-C to end.
^C
nsecs : count distribution Total nsecs based on the right column of the nsecs column
0 -> 1 : 0 | | 1*0 = 0
2 -> 3 : 0 | | 3*0 = 0
4 -> 7 : 0 | | 7*0 = 0
8 -> 15 : 0 | | 15*0 = 0
16 -> 31 : 0 | | 31*0 = 0
32 -> 63 : 0 | | 63*0 = 0
64 -> 127 : 0 | | 127*0 = 0
128 -> 255 : 0 | | 255*0 = 0
256 -> 511 : 0 | | 511*0 = 0
512 -> 1023 : 0 | | 1023*0 = 0
1024 -> 2047 : 337 | | 2047*337 = 724939
2048 -> 4095 : 6425036 |*********** | 4095*6425036 = 26319366300
4096 -> 8191 : 22267018 |****************************************| 8191*22267018 = 182389989338
8192 -> 16383 : 1287200 |** | 16383*1287200 = 21064176000
16384 -> 32767 : 17695 | | 32767*17695 = 579830665
32768 -> 65535 : 654 | | 654*654 = 427716
65536 -> 131071 : 90 | | 131071*90 = 11796390
131072 -> 262143 : 2940 | | 262143*2940 = 770700420
262144 -> 524287 : 78 | | 524287*78 = 40894386
524288 -> 1048575 : 42 | | 1048575*42 = 44040150
1048576 -> 2097151 : 7 | | 2097151*7 = 14680057
2097152 -> 4194303 : 1 | | 4194303*1 = 4194303
4194304 -> 8388607 : 1 | | 8388607*1 = 8388607
8388608 -> 16777215 : 1 | | 16777215*1 = 16777215
avg = 5132 nsecs, total: 153980822172 nsecs, count: 30001100
3) multi-range-select case (2000 queries):
$ sudo /usr/share/bcc/tools/funclatency -p 2316565 /mysql/8.4.9/bin/mysqld:_Z23buf_page_optimistic_getmP11buf_block_tm10Page_fetchPKcmP5mtr_t
Tracing 1 functions for "/mysql/8.4.9/bin/mysqld:_Z23buf_page_optimistic_getmP11buf_block_tm10Page_fetchPKcmP5mtr_t"... Hit Ctrl-C to end.
^C
nsecs : count distribution Total nsecs based on the right column of the nsecs column
0 -> 1 : 0 | | 1*0 = 0
2 -> 3 : 0 | | 3*0 = 0
4 -> 7 : 0 | | 7*0 = 0
8 -> 15 : 0 | | 15*0 = 0
16 -> 31 : 0 | | 31*0 = 0
32 -> 63 : 0 | | 63*0 = 0
64 -> 127 : 0 | | 127*0 = 0
128 -> 255 : 0 | | 255*0 = 0
256 -> 511 : 0 | | 511*0 = 0
512 -> 1023 : 0 | | 1023*0 = 0
1024 -> 2047 : 337 | | 2047*337 = 724939
2048 -> 4095 : 6425036 |*********** | 4095*6425036 = 26319366300
4096 -> 8191 : 22267018 |****************************************| 8191*22267018 = 182389989338
8192 -> 16383 : 1287200 |** | 16383*1287200 = 21064176000
16384 -> 32767 : 17695 | | 32767*17695 = 579830665
32768 -> 65535 : 654 | | 654*654 = 427716
65536 -> 131071 : 90 | | 131071*90 = 11796390
131072 -> 262143 : 2940 | | 262143*2940 = 770700420
262144 -> 524287 : 78 | | 524287*78 = 40894386
524288 -> 1048575 : 42 | | 1048575*42 = 44040150
1048576 -> 2097151 : 7 | | 2097151*7 = 14680057
2097152 -> 4194303 : 1 | | 4194303*1 = 4194303
4194304 -> 8388607 : 1 | | 8388607*1 = 8388607
8388608 -> 16777215 : 1 | | 16777215*1 = 16777215
avg = 5132 nsecs, total: 153980822172 nsecs, count: 30001100
Besides the high difference on total and average latency, we also see that the distribution is more skewed on the got-page test case. With interval latency reaching up to 524287 nsecs with 498806, while this same interval has only 375 occurrences on the multithreaded case. Also, we can see that on the hot-page test there are a many ocurreces on the higher latency intervals, for example in the interval from 524288 -> 1048575 there are 326693 occurrences, which basing on the highest latency interval, shows us 343561213650 nsecs of latency, higher than the latency on other lower latency intervals.
Suggested fix:
Hot-page range-select workload should scale better (or at least avoid excessive latch contention on BUF_BLOCK_MUTEX); ideally, reduce the need/frequency of acquiring BUF_BLOCK_MUTEX for read-only access in a hot-page scenario.
Description: High contention on InnoDB per-page BUF_BLOCK_MUTEX during concurrent range-selects to the same row/page (buf_page_get_gen() / buf_page_mutex_enter() path). Please investigate reducing the frequency/necessity of this mutex in extreme hot-page workloads. When many connections run the same range query against the same row, SHOW ENGINE INNODB STATUS frequently reports semaphore waits on Mutex BUF_BLOCK_MUTEX from buf0buf.cc. Even though each wait often prints as 0 seconds, the volume is high and correlates with lower throughput compared to a test that spreads reads across different rows/pages. Example from SHOW ENGINE INNODB STATUS semaphores section: --Thread 140408444081728 has waited at buf0buf.cc line 4520 for 0 seconds the semaphore: Mutex at 0x7fb7a4ce16f0, Mutex BUF_BLOCK_MUTEX created buf0buf.cc:0, locked by 140398751872576 --Thread 140398731793984 has waited at buf0buf.cc line 4520 for 0 seconds the semaphore: Mutex at 0x7fb7a4ce16f0, Mutex BUF_BLOCK_MUTEX created buf0buf.cc:0, locked by 140398723339840 Request: Investigate whether InnoDB can reduce the need to acquire the per-block mutex (or shorten critical section / use different atomic scheme) in the “same hot page, high concurrency, read-only range selects” case. How to repeat: Create table contention_test with enough rows (I used 30k rows per value of col_a 300 values total used on this test); ensure query uses index on col_a. CREATE TABLE `contention_test` ( `id` int NOT NULL AUTO_INCREMENT, `col_a` int DEFAULT NULL, `col_b` int DEFAULT NULL, PRIMARY KEY (`id`), KEY `col_a` (`col_a`) ) ENGINE=InnoDB AUTO_INCREMENT=9000002 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci 1) Run hot-page test (all threads hitting the same value col_a=1), with 100 concurrency and 10 iterations, for a total of 1000 queries. $time mysqlslap --user=root --password=msandbox --host=127.0.0.1 --port=8409 \ --concurrency=100 --iterations=10 --create-schema=test \ --query="SELECT * FROM contention_test WHERE col_a = 1" 2) Run fewer contention tests (spread across different col_a values; 10 concurrent mysqlslap processes, each concurrency=10, different col_a values stepping by 25), for a total of 1000 queries: $ time ( for i in {1..10}; do mysqlslap --user=root --password=msandbox --host=127.0.0.1 --port=8409 \ --concurrency=10 --iterations=10 --create-schema=test \ --query="SELECT * FROM contention_test WHERE col_a = ${j}" & j=$((j+25)) done; wait; ) 3) Run fewer contention tests (spread across different col_a values; 20 concurrent mysqlslap processes, each concurrency=10, different col_a values stepping by 13), for a total of 2000 queries: $time ( for i in {1..20}; do mysqlslap --user=root --password=msandbox --host=127.0.0.1 --port=8409 \ --concurrency=10 --iterations=10 --create-schema=test \ --query="SELECT * FROM contention_test WHERE col_a = ${j}" & j=$((j+13)) done; wait; ) Observed results 1) Hot-page test (100 concurrency, same col_a, 1000 queries): real 0m31.510s user 0m38.015s sys 0m5.649s 2) Spread test (10 concurrent mysqlslap processes, each concurrency=10, different col_a values, 1000 queries): real 0m20.937s user 0m8.905s sys 0m2.701s 3) Spread test (20 concurrent mysqlslap processes, each concurrency=10, different col_a values, 2000 queries): real 0m30.605s user 0m42.118s sys 0m8.313s We could see that doubling the number of concurrent mysqlslap processes from 10 to 20 resulted in a time distribution more similar to the hot-page test, even though the total number of queries was doubled. Furthermore, we used bcc to check the latency of the buf_page_optimistic_get function and found that the latency is much higher in the single-range-select case. 1) single range-select case (1000 queries): $ sudo /usr/share/bcc/tools/funclatency -p 2316565 /mysql/8.4.9/bin/mysqld:_Z23buf_page_optimistic_getmP11buf_block_tm10Page_fetchPKcmP5mtr_t Tracing 1 functions for "/mysql/8.4.9/bin/mysqld:_Z23buf_page_optimistic_getmP11buf_block_tm10Page_fetchPKcmP5mtr_t"... Hit Ctrl-C to end. ^C nsecs : count distribution Total nsecs based on the right column of the nsecs column 0 -> 1 : 0 | | 1*0 = 0 2 -> 3 : 0 | | 3*0 = 0 4 -> 7 : 0 | | 7*0 = 0 8 -> 15 : 0 | | 15*0 = 0 16 -> 31 : 0 | | 31*0 = 0 32 -> 63 : 0 | | 63*0 = 0 64 -> 127 : 0 | | 127*0 = 0 128 -> 255 : 0 | | 255*0 = 0 256 -> 511 : 0 | | 511*0 = 0 512 -> 1023 : 0 | | 1023*0 = 0 1024 -> 2047 : 12228 | | 2047*12228 = 25014996 2048 -> 4095 : 4306291 |********** | 4095*4306291 = 17646460045 4096 -> 8191 : 15674574 |****************************************| 8191*15674574 = 128504071094 8192 -> 16383 : 5406530 |************* | 16383*5406530 = 88479057990 16384 -> 32767 : 1676762 |**** | 32767*1676762 = 54946584734 32768 -> 65535 : 1029455 |** | 65535*1029455 = 67504803825 65536 -> 131071 : 604328 |* | 131071*604328 = 79160131488 131072 -> 262143 : 324081 | | 262143*324081 = 8495105283 262144 -> 524287 : 498806 |* | 524287*498806 = 261580561022 524288 -> 1048575 : 326693 | | 1048575*326693 = 343561213650 1048576 -> 2097151 : 123370 | | 2097151*123370 = 258862287700 2097152 -> 4194303 : 17373 | | 4194303*17373 = 728541969 4194304 -> 8388607 : 508 | | 8388607*508 = 4261412356 8388608 -> 16777215 : 1 | | 16777215*1 = 16777215 avg = 33747 nsecs, total: 1012463558244 nsecs, count: 30001000 2) multi range-select case (1000 queries): $ sudo /usr/share/bcc/tools/funclatency -p 2316565 /home/roberto.garciadebem/opt/mysql/8.4.9/bin/mysqld:_Z23buf_page_optimistic_getmP11buf_block_tm10Page_fetchPKcmP5mtr_t Tracing 1 functions for "/home/roberto.garciadebem/opt/mysql/8.4.9/bin/mysqld:_Z23buf_page_optimistic_getmP11buf_block_tm10Page_fetchPKcmP5mtr_t"... Hit Ctrl-C to end. ^C nsecs : count distribution Total nsecs based on the right column of the nsecs column 0 -> 1 : 0 | | 1*0 = 0 2 -> 3 : 0 | | 3*0 = 0 4 -> 7 : 0 | | 7*0 = 0 8 -> 15 : 0 | | 15*0 = 0 16 -> 31 : 0 | | 31*0 = 0 32 -> 63 : 0 | | 63*0 = 0 64 -> 127 : 0 | | 127*0 = 0 128 -> 255 : 0 | | 255*0 = 0 256 -> 511 : 0 | | 511*0 = 0 512 -> 1023 : 0 | | 1023*0 = 0 1024 -> 2047 : 337 | | 2047*337 = 724939 2048 -> 4095 : 6425036 |*********** | 4095*6425036 = 26319366300 4096 -> 8191 : 22267018 |****************************************| 8191*22267018 = 182389989338 8192 -> 16383 : 1287200 |** | 16383*1287200 = 21064176000 16384 -> 32767 : 17695 | | 32767*17695 = 579830665 32768 -> 65535 : 654 | | 654*654 = 427716 65536 -> 131071 : 90 | | 131071*90 = 11796390 131072 -> 262143 : 2940 | | 262143*2940 = 770700420 262144 -> 524287 : 78 | | 524287*78 = 40894386 524288 -> 1048575 : 42 | | 1048575*42 = 44040150 1048576 -> 2097151 : 7 | | 2097151*7 = 14680057 2097152 -> 4194303 : 1 | | 4194303*1 = 4194303 4194304 -> 8388607 : 1 | | 8388607*1 = 8388607 8388608 -> 16777215 : 1 | | 16777215*1 = 16777215 avg = 5132 nsecs, total: 153980822172 nsecs, count: 30001100 3) multi-range-select case (2000 queries): $ sudo /usr/share/bcc/tools/funclatency -p 2316565 /mysql/8.4.9/bin/mysqld:_Z23buf_page_optimistic_getmP11buf_block_tm10Page_fetchPKcmP5mtr_t Tracing 1 functions for "/mysql/8.4.9/bin/mysqld:_Z23buf_page_optimistic_getmP11buf_block_tm10Page_fetchPKcmP5mtr_t"... Hit Ctrl-C to end. ^C nsecs : count distribution Total nsecs based on the right column of the nsecs column 0 -> 1 : 0 | | 1*0 = 0 2 -> 3 : 0 | | 3*0 = 0 4 -> 7 : 0 | | 7*0 = 0 8 -> 15 : 0 | | 15*0 = 0 16 -> 31 : 0 | | 31*0 = 0 32 -> 63 : 0 | | 63*0 = 0 64 -> 127 : 0 | | 127*0 = 0 128 -> 255 : 0 | | 255*0 = 0 256 -> 511 : 0 | | 511*0 = 0 512 -> 1023 : 0 | | 1023*0 = 0 1024 -> 2047 : 337 | | 2047*337 = 724939 2048 -> 4095 : 6425036 |*********** | 4095*6425036 = 26319366300 4096 -> 8191 : 22267018 |****************************************| 8191*22267018 = 182389989338 8192 -> 16383 : 1287200 |** | 16383*1287200 = 21064176000 16384 -> 32767 : 17695 | | 32767*17695 = 579830665 32768 -> 65535 : 654 | | 654*654 = 427716 65536 -> 131071 : 90 | | 131071*90 = 11796390 131072 -> 262143 : 2940 | | 262143*2940 = 770700420 262144 -> 524287 : 78 | | 524287*78 = 40894386 524288 -> 1048575 : 42 | | 1048575*42 = 44040150 1048576 -> 2097151 : 7 | | 2097151*7 = 14680057 2097152 -> 4194303 : 1 | | 4194303*1 = 4194303 4194304 -> 8388607 : 1 | | 8388607*1 = 8388607 8388608 -> 16777215 : 1 | | 16777215*1 = 16777215 avg = 5132 nsecs, total: 153980822172 nsecs, count: 30001100 Besides the high difference on total and average latency, we also see that the distribution is more skewed on the got-page test case. With interval latency reaching up to 524287 nsecs with 498806, while this same interval has only 375 occurrences on the multithreaded case. Also, we can see that on the hot-page test there are a many ocurreces on the higher latency intervals, for example in the interval from 524288 -> 1048575 there are 326693 occurrences, which basing on the highest latency interval, shows us 343561213650 nsecs of latency, higher than the latency on other lower latency intervals. Suggested fix: Hot-page range-select workload should scale better (or at least avoid excessive latch contention on BUF_BLOCK_MUTEX); ideally, reduce the need/frequency of acquiring BUF_BLOCK_MUTEX for read-only access in a hot-page scenario.