Bug #120388 High contention on InnoDB per-page BUF_BLOCK_MUTEX during concurrent range-selects to same row/page
Submitted: 30 Apr 19:36
Reporter: Roberto de Bem Email Updates:
Status: Open Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S5 (Performance)
Version:8.4.9 OS:Oracle Linux (9.5)
Assigned to: CPU Architecture:x86
Tags: buf block, buffer pool, contention, MYSQL SERVER

[30 Apr 19:36] Roberto de Bem
Description:
High contention on InnoDB per-page BUF_BLOCK_MUTEX during concurrent range-selects to the same row/page (buf_page_get_gen() / buf_page_mutex_enter() path). 
Please investigate reducing the frequency/necessity of this mutex in extreme hot-page workloads.

When many connections run the same range query against the same row, SHOW ENGINE INNODB STATUS frequently reports semaphore waits on Mutex BUF_BLOCK_MUTEX from buf0buf.cc. Even though each wait often prints as 0 seconds, the volume is high and correlates with lower throughput compared to a test that spreads reads across different rows/pages.

Example from SHOW ENGINE INNODB STATUS semaphores section:

--Thread 140408444081728 has waited at buf0buf.cc line 4520 for 0 seconds the semaphore:
Mutex at 0x7fb7a4ce16f0, Mutex BUF_BLOCK_MUTEX created buf0buf.cc:0, locked by 140398751872576

--Thread 140398731793984 has waited at buf0buf.cc line 4520 for 0 seconds the semaphore:
Mutex at 0x7fb7a4ce16f0, Mutex BUF_BLOCK_MUTEX created buf0buf.cc:0, locked by 140398723339840

Request: Investigate whether InnoDB can reduce the need to acquire the per-block mutex (or shorten critical section / use different atomic scheme) in the “same hot page, high concurrency, read-only range selects” case.

How to repeat:
Create table contention_test with enough rows (I used 30k rows per value of col_a 300 values total used on this test); ensure query uses index on col_a.

CREATE TABLE `contention_test` (
  `id` int NOT NULL AUTO_INCREMENT,
  `col_a` int DEFAULT NULL,
  `col_b` int DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `col_a` (`col_a`)
) ENGINE=InnoDB AUTO_INCREMENT=9000002 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

1) Run hot-page test (all threads hitting the same value col_a=1), with 100 concurrency and 10 iterations, for a total of 1000 queries.
$time mysqlslap --user=root --password=msandbox --host=127.0.0.1 --port=8409 \
  --concurrency=100 --iterations=10 --create-schema=test \
  --query="SELECT * FROM contention_test WHERE col_a = 1"

2) Run fewer contention tests (spread across different col_a values; 10 concurrent mysqlslap processes, each concurrency=10, different col_a values stepping by 25), for a total of 1000 queries:
$ time ( for i in {1..10}; do
  mysqlslap --user=root --password=msandbox --host=127.0.0.1 --port=8409 \
    --concurrency=10 --iterations=10 --create-schema=test \
    --query="SELECT * FROM contention_test WHERE col_a = ${j}" &
  j=$((j+25))
done; wait; )

3) Run fewer contention tests (spread across different col_a values; 20 concurrent mysqlslap processes, each concurrency=10, different col_a values stepping by 13), for a total of 2000 queries:
$time ( for i in {1..20}; do
  mysqlslap --user=root --password=msandbox --host=127.0.0.1 --port=8409 \
    --concurrency=10 --iterations=10 --create-schema=test \
    --query="SELECT * FROM contention_test WHERE col_a = ${j}" &
  j=$((j+13))
done; wait; )

Observed results

1) Hot-page test (100 concurrency, same col_a, 1000 queries):
real	0m31.510s
user	0m38.015s
sys	0m5.649s

2) Spread test (10 concurrent mysqlslap processes, each concurrency=10, different col_a values, 1000 queries):
real	0m20.937s
user	0m8.905s
sys	0m2.701s

3) Spread test (20 concurrent mysqlslap processes, each concurrency=10, different col_a values, 2000 queries):

real	0m30.605s
user	0m42.118s
sys	0m8.313s

We could see that doubling the number of concurrent mysqlslap processes from 10 to 20 resulted in a time distribution more similar to the hot-page test, even though the total number of queries was doubled.

Furthermore, we used bcc to check the latency of the buf_page_optimistic_get function and found that the latency is much higher in the single-range-select case.

1) single range-select case (1000 queries):
$ sudo /usr/share/bcc/tools/funclatency -p 2316565 /mysql/8.4.9/bin/mysqld:_Z23buf_page_optimistic_getmP11buf_block_tm10Page_fetchPKcmP5mtr_t
Tracing 1 functions for "/mysql/8.4.9/bin/mysqld:_Z23buf_page_optimistic_getmP11buf_block_tm10Page_fetchPKcmP5mtr_t"... Hit Ctrl-C to end.
^C
     nsecs               : count     distribution                              Total nsecs based on the right column of the nsecs column 
         0 -> 1          : 0        |                                        | 1*0            = 0
         2 -> 3          : 0        |                                        | 3*0            = 0
         4 -> 7          : 0        |                                        | 7*0            = 0
         8 -> 15         : 0        |                                        | 15*0           = 0
        16 -> 31         : 0        |                                        | 31*0           = 0
        32 -> 63         : 0        |                                        | 63*0           = 0
        64 -> 127        : 0        |                                        | 127*0          = 0
       128 -> 255        : 0        |                                        | 255*0          = 0
       256 -> 511        : 0        |                                        | 511*0          = 0
       512 -> 1023       : 0        |                                        | 1023*0         = 0
      1024 -> 2047       : 12228    |                                        | 2047*12228     = 25014996
      2048 -> 4095       : 4306291  |**********                              | 4095*4306291   = 17646460045
      4096 -> 8191       : 15674574 |****************************************| 8191*15674574  = 128504071094
      8192 -> 16383      : 5406530  |*************                           | 16383*5406530  = 88479057990
     16384 -> 32767      : 1676762  |****                                    | 32767*1676762  = 54946584734
     32768 -> 65535      : 1029455  |**                                      | 65535*1029455  = 67504803825
     65536 -> 131071     : 604328   |*                                       | 131071*604328  = 79160131488
    131072 -> 262143     : 324081   |                                        | 262143*324081  = 8495105283
    262144 -> 524287     : 498806   |*                                       | 524287*498806  = 261580561022
    524288 -> 1048575    : 326693   |                                        | 1048575*326693 = 343561213650
   1048576 -> 2097151    : 123370   |                                        | 2097151*123370 = 258862287700
   2097152 -> 4194303    : 17373    |                                        | 4194303*17373  = 728541969
   4194304 -> 8388607    : 508      |                                        | 8388607*508    = 4261412356
   8388608 -> 16777215   : 1        |                                        | 16777215*1     = 16777215

avg = 33747 nsecs, total: 1012463558244 nsecs, count: 30001000

2) multi range-select case (1000 queries):

$ sudo /usr/share/bcc/tools/funclatency -p 2316565 /home/roberto.garciadebem/opt/mysql/8.4.9/bin/mysqld:_Z23buf_page_optimistic_getmP11buf_block_tm10Page_fetchPKcmP5mtr_t
Tracing 1 functions for "/home/roberto.garciadebem/opt/mysql/8.4.9/bin/mysqld:_Z23buf_page_optimistic_getmP11buf_block_tm10Page_fetchPKcmP5mtr_t"... Hit Ctrl-C to end.
^C
     nsecs               : count     distribution                              Total nsecs based on the right column of the nsecs column
         0 -> 1          : 0        |                                        | 1*0           = 0
         2 -> 3          : 0        |                                        | 3*0           = 0
         4 -> 7          : 0        |                                        | 7*0           = 0
         8 -> 15         : 0        |                                        | 15*0          = 0
        16 -> 31         : 0        |                                        | 31*0          = 0
        32 -> 63         : 0        |                                        | 63*0          = 0
        64 -> 127        : 0        |                                        | 127*0         = 0
       128 -> 255        : 0        |                                        | 255*0         = 0
       256 -> 511        : 0        |                                        | 511*0         = 0
       512 -> 1023       : 0        |                                        | 1023*0        = 0
      1024 -> 2047       : 337      |                                        | 2047*337      = 724939
      2048 -> 4095       : 6425036  |***********                             | 4095*6425036  = 26319366300
      4096 -> 8191       : 22267018 |****************************************| 8191*22267018 = 182389989338
      8192 -> 16383      : 1287200  |**                                      | 16383*1287200 = 21064176000
     16384 -> 32767      : 17695    |                                        | 32767*17695   = 579830665
     32768 -> 65535      : 654      |                                        | 654*654       = 427716
     65536 -> 131071     : 90       |                                        | 131071*90     = 11796390
    131072 -> 262143     : 2940     |                                        | 262143*2940   = 770700420
    262144 -> 524287     : 78       |                                        | 524287*78     = 40894386
    524288 -> 1048575    : 42       |                                        | 1048575*42    = 44040150
   1048576 -> 2097151    : 7        |                                        | 2097151*7     = 14680057
   2097152 -> 4194303    : 1        |                                        | 4194303*1     = 4194303
   4194304 -> 8388607    : 1        |                                        | 8388607*1     = 8388607
   8388608 -> 16777215   : 1        |                                        | 16777215*1    = 16777215

avg = 5132 nsecs, total: 153980822172 nsecs, count: 30001100

3) multi-range-select case (2000 queries):

$ sudo /usr/share/bcc/tools/funclatency -p 2316565 /mysql/8.4.9/bin/mysqld:_Z23buf_page_optimistic_getmP11buf_block_tm10Page_fetchPKcmP5mtr_t
Tracing 1 functions for "/mysql/8.4.9/bin/mysqld:_Z23buf_page_optimistic_getmP11buf_block_tm10Page_fetchPKcmP5mtr_t"... Hit Ctrl-C to end.
^C
     nsecs               : count     distribution                              Total nsecs based on the right column of the nsecs column
         0 -> 1          : 0        |                                        | 1*0 = 0
         2 -> 3          : 0        |                                        | 3*0 = 0
         4 -> 7          : 0        |                                        | 7*0 = 0
         8 -> 15         : 0        |                                        | 15*0 = 0
        16 -> 31         : 0        |                                        | 31*0 = 0
        32 -> 63         : 0        |                                        | 63*0 = 0
        64 -> 127        : 0        |                                        | 127*0 = 0
       128 -> 255        : 0        |                                        | 255*0 = 0
       256 -> 511        : 0        |                                        | 511*0 = 0
       512 -> 1023       : 0        |                                        | 1023*0 = 0
      1024 -> 2047       : 337      |                                        | 2047*337 = 724939
      2048 -> 4095       : 6425036  |***********                             | 4095*6425036 = 26319366300
      4096 -> 8191       : 22267018 |****************************************| 8191*22267018 = 182389989338
      8192 -> 16383      : 1287200  |**                                      | 16383*1287200 = 21064176000
     16384 -> 32767      : 17695    |                                        | 32767*17695 = 579830665
     32768 -> 65535      : 654      |                                        | 654*654 = 427716
     65536 -> 131071     : 90       |                                        | 131071*90 = 11796390
    131072 -> 262143     : 2940     |                                        | 262143*2940 = 770700420
    262144 -> 524287     : 78       |                                        | 524287*78 = 40894386
    524288 -> 1048575    : 42       |                                        | 1048575*42 = 44040150
   1048576 -> 2097151    : 7        |                                        | 2097151*7 = 14680057
   2097152 -> 4194303    : 1        |                                        | 4194303*1 = 4194303
   4194304 -> 8388607    : 1        |                                        | 8388607*1 = 8388607
   8388608 -> 16777215   : 1        |                                        | 16777215*1 = 16777215

avg = 5132 nsecs, total: 153980822172 nsecs, count: 30001100

Besides the high difference on total and average latency, we also see that the distribution is more skewed on the got-page test case. With interval latency reaching up to 524287 nsecs with 498806, while this same interval has only 375 occurrences on the multithreaded case. Also, we can see that on the hot-page test there are a many ocurreces on the higher latency intervals, for example in the interval from 524288 -> 1048575 there are 326693 occurrences, which basing on the highest latency interval, shows us 343561213650 nsecs of latency, higher than the latency on other lower latency intervals.

Suggested fix:
Hot-page range-select workload should scale better (or at least avoid excessive latch contention on BUF_BLOCK_MUTEX); ideally, reduce the need/frequency of acquiring BUF_BLOCK_MUTEX for read-only access in a hot-page scenario.
[1 May 1:31] Jean-François Gagné
Note: I am not part of MySQL Support Team, just a Community Member trying to be helpful...

> Create table contention_test with enough rows (I used 30k rows per value of col_a 300 values total used on this test); ensure query uses index on col_a.

How do you create such table ?  Are the rows with the same value for col_a consecutive, or do you have 300 consecutive rows with values for col_a from 1 to 300 repeated 100 times ?  I think this might be important for the results.

> with 100 concurrency and 10 iterations

How many CPUs do you have on the server running this ?  If you have less than 100 CPU, there might be more than Mutex Contention.  I saw a situation in the past where having more concurrent queries than CPU caused significant slow-down, and my guess was that a thread was unscheduled while holding a mutex.  Setting innodb_thread_concurrency [1] to 85% of the CPUs greatly improve query latency in my case.

[1]: https://dev.mysql.com/doc/refman/8.4/en/innodb-parameters.html#sysvar_innodb_thread_concur...

Have you considered running your test while monitoring with ps-top [2] ?  You might be able to get good information about total Mutex acquisition time, at least better than what is shown by SHOW ENGINE INNODB STATUS.  You might get the same information by manually querying the P_S table events_waits_summary_global_by_event_name after enabling the instruments "wait/synch/mutex/%" (and maybe 'wait/synch/sxlock/innodb/%').  

[2]: https://github.com/sjmudd/ps-top