MySQL Bugs: #92133: DICT_SYS mutex contention causes complete stall when running with 40 mill tables

Bug #92133	DICT_SYS mutex contention causes complete stall when running with 40 mill tables
Submitted:	22 Aug 2018 17:43	Modified:	27 Oct 2018 15:02
Reporter:	Alexander Rubin	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Data Dictionary	Severity:	S5 (Performance)
Version:	8.0.12	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
When running with millions of tables (and shared tablespaces) I can see the complete stall waiting on DICT_SYS mutex

Sysbench load:

[ 453s ] thds: 32 tps: 1203.96 qps: 1203.96 (r/w/o: 1203.96/0.00/0.00) lat (ms,95%): 41.10 err/s: 0.00 reconn/s: 0.00
[ 454s ] thds: 32 tps: 1202.32 qps: 1202.32 (r/w/o: 1202.32/0.00/0.00) lat (ms,95%): 42.61 err/s: 0.00 reconn/s: 0.00
[ 455s ] thds: 32 tps: 1196.74 qps: 1196.74 (r/w/o: 1196.74/0.00/0.00) lat (ms,95%): 41.10 err/s: 0.00 reconn/s: 0.00
[ 456s ] thds: 32 tps: 1197.18 qps: 1197.18 (r/w/o: 1197.18/0.00/0.00) lat (ms,95%): 41.10 err/s: 0.00 reconn/s: 0.00
[ 457s ] thds: 32 tps: 887.11 qps: 887.11 (r/w/o: 887.11/0.00/0.00) lat (ms,95%): 41.10 err/s: 0.00 reconn/s: 0.00
[ 458s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 459s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 460s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 461s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 462s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 463s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 464s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 465s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 466s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 467s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 468s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 469s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 470s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 471s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 472s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 473s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 474s ] thds: 32 tps: 403.96 qps: 403.96 (r/w/o: 403.96/0.00/0.00) lat (ms,95%): 16819.24 err/s: 0.00 reconn/s: 0.00
[ 475s ] thds: 32 tps: 1196.00 qps: 1196.00 (r/w/o: 1196.00/0.00/0.00) lat (ms,95%): 41.85 err/s: 0.00 reconn/s: 0.00
[ 476s ] thds: 32 tps: 1208.96 qps: 1208.96 (r/w/o: 1208.96/0.00/0.00) lat (ms,95%): 41.85 err/s: 0.00 reconn/s: 0.00
[ 477s ] thds: 32 tps: 1192.06 qps: 1192.06 (r/w/o: 1192.06/0.00/0.00) lat (ms,95%): 41.85 err/s: 0.00 reconn/s: 0.00
[ 478s ] thds: 32 tps: 1173.89 qps: 1173.89 (r/w/o: 1173.89/0.00/0.00) lat (ms,95%): 43.39 err/s: 0.00 reconn/s: 0.00
[ 479s ] thds: 32 tps: 1207.07 qps: 1207.07 (r/w/o: 1207.07/0.00/0.00) lat (ms,95%): 41.85 err/s: 0.00 reconn/s: 0.00

Show engine innodb status shows:

----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 2195106
--Thread 140451928655616 has waited at fts0opt.cc line 2703 for 13.00 seconds the semaphore:
Mutex at 0x7fbeb83e6bd8, Mutex DICT_SYS created dict0dict.cc:1019, lock var 1

--Thread 140451898394368 has waited at ha_innodb.cc line 6080 for 13.00 seconds the semaphore:
Mutex at 0x7fbeb83e6bd8, Mutex DICT_SYS created dict0dict.cc:1019, lock var 1

--Thread 140457041856256 has waited at ha_innodb.cc line 6080 for 13.00 seconds the semaphore:
Mutex at 0x7fbeb83e6bd8, Mutex DICT_SYS created dict0dict.cc:1019, lock var 1

--Thread 140451897509632 has waited at dict0dd.ic line 273 for 13.00 seconds the semaphore:
Mutex at 0x7fbeb83e6bd8, Mutex DICT_SYS created dict0dict.cc:1019, lock var 1

--Thread 140451898689280 has waited at ha_innodb.cc line 5814 for 13.00 seconds the semaphore:
Mutex at 0x7fbeb83e6bd8, Mutex DICT_SYS created dict0dict.cc:1019, lock var 1

--Thread 140456571414272 has waited at ha_innodb.cc line 6080 for 13.00 seconds the semaphore:
Mutex at 0x7fbeb83e6bd8, Mutex DICT_SYS created dict0dict.cc:1019, lock var 1

--Thread 140456571119360 has waited at ha_innodb.cc line 6080 for 13.00 seconds the semaphore:
Mutex at 0x7fbeb83e6bd8, Mutex DICT_SYS created dict0dict.cc:1019, lock var 1

--Thread 140457043920640 has waited at ha_innodb.cc line 6080 for 13.00 seconds the semaphore:
Mutex at 0x7fbeb83e6bd8, Mutex DICT_SYS created dict0dict.cc:1019, lock var 1

--Thread 140451896329984 has waited at ha_innodb.cc line 6080 for 13.00 seconds the semaphore:
Mutex at 0x7fbeb83e6bd8, Mutex DICT_SYS created dict0dict.cc:1019, lock var 1

--Thread 140451898099456 has waited at ha_innodb.cc line 6080 for 13.00 seconds the semaphore:
Mutex at 0x7fbeb83e6bd8, Mutex DICT_SYS created dict0dict.cc:1019, lock var 1

--Thread 140457040971520 has waited at ha_innodb.cc line 6080 for 13.00 seconds the semaphore:
Mutex at 0x7fbeb83e6bd8, Mutex DICT_SYS created dict0dict.cc:1019, lock var 1

--Thread 140451897214720 has waited at ha_innodb.cc line 6080 for 13.00 seconds the semaphore:
Mutex at 0x7fbeb83e6bd8, Mutex DICT_SYS created dict0dict.cc:1019, lock var 1

--Thread 140456572004096 has waited at ha_innodb.cc line 6080 for 13.00 seconds the semaphore:
Mutex at 0x7fbeb83e6bd8, Mutex DICT_SYS created dict0dict.cc:1019, lock var 1

--Thread 140457043035904 has waited at ha_innodb.cc line 6080 for 13.00 seconds the semaphore:
Mutex at 0x7fbeb83e6bd8, Mutex DICT_SYS created dict0dict.cc:1019, lock var 1

--Thread 140451899279104 has waited at ha_innodb.cc line 5814 for 13.00 seconds the semaphore:
Mutex at 0x7fbeb83e6bd8, Mutex DICT_SYS created dict0dict.cc:1019, lock var 1

How to repeat:
Create 1 million to 40 million tables with MySQL 8.0.12.
Script to create 40M tables in parallel:
============================
#/bin/bash
function do_db {
        db_exist=$(mysql -A -s -Nbe "select 1 from information_schema.schemata where schema_name = '$db'")
        if [ "$db_exist" == "1" ]; then echo "Already exists $db"; return 0; fi; 
        mysql -vvv -e "create database $db";
        mysql -vvv $db -e "CREATE TABLESPACE $db ADD DATAFILE '$db.ibd' engine=InnoDB;"
        for i in {1..100}
        do
                table="CREATE TABLE sbtest$i ( id int(10) unsigned NOT NULL AUTO_INCREMENT, k int(10) unsigned NOT NULL DEFAULT '0', c varchar(120) NOT NULL DEFAULT '', pad varchar(60) NOT NULL DEFAULT '', PRIMARY KEY (id), KEY k_1 (k) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 tablespace $db;"
                mysql $db -e "$table"
        done
}
c=0
for m in {1..4000000}
do
        for i in {1..40}
        do
                let c=$c+1
                echo $c
                db="sbtest_$c"
                do_db &
        done
        wait
        #if [ $c > 4000000 ]; then exit; fi
done
============================

Run sysbench:
============================
function run_sb() {
conn=" --db-driver=mysql --mysql-socket=/var/lib/mysql-data/mysql.sock  --mysql-db=sbtest_1 --mysql-user=sbtest --mysql-password=abc "
sysbench $conn --oltp_db_count=$db_count --oltp_tables_count=$table_count --oltp-table-size=10000 --report-interval=1 --num-threads=$num_threads --max-requests=0 --max-time=$max_time ./select_custom.lua run | tee -a sysbench_2.txt

}

let db_count=400000
table_count=100
max_time=10000
num_threads=32
run_sb 
============================

Verified but I changed it to S5 (performance).

Thanks for the report
Bogdan

Any update on this? This is a very concerning issue for my company as we have millions of tables.

Scripts to reproduce:

1. Create 20K+ tables first

#/bin/bash
function do_db {
        db_exist=$(mysql -A -s -Nbe "select 1 from information_schema.schemata where schema_name = '$db'")
        if [ "$db_exist" == "1" ]; then echo "Already exists $db"; return 0; fi; 
        mysql -vvv -e "create database $db";
        mysql -vvv $db -e "CREATE TABLESPACE $db ADD DATAFILE '$db.ibd' engine=InnoDB;"
        for i in {1..100}
        do
                table="CREATE TABLE sbtest$i ( id int(10) unsigned NOT NULL AUTO_INCREMENT, k int(10) unsigned NOT NULL DEFAULT '0', c varchar(120) NOT NULL DEFAULT '', pad varchar(60) NOT NULL DEFAULT '', PRIMARY KEY (id), KEY k_1 (k) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 tablespace $db;"
                mysql $db -e "$table"
        done
}
c=0
for m in {1..4000000}
do
        for i in {1..40}
        do
                let c=$c+1
                echo $c
                db="sbtest_$c"
                do_db &
        done
        wait
        #if [ $c > 4000000 ]; then exit; fi
done

2. Run this script:

##### shel script, sysbench_mysql.sh
#####

function run_sb() {
        conn=" --db-driver=mysql --mysql-socket=/var/run/mysqld/mysqld.sock  --mysql-db=mysql --mysql-user=sbtest --mysql-password=abc "
        sysbench $conn --oltp_db_count=$db_count --oltp_tables_count=$table_count --oltp-table-size=10000 --report-interval=1 --num-threads=$num_threads --max-requests=0 --max-time=$max_time ./select_custom.lua run | tee -a sysbench_2.txt
}
let db_count=70000
table_count=1000
max_time=10000
num_threads=32
run_sb

#### select_custom.lua
####

pathtest = "/usr/share/sysbench/tests/include/oltp_legacy/"
if pathtest then
   dofile(pathtest .. "common.lua")
else
   require("common")
end
function thread_init(thread_id)
   set_vars()
end
function event()
   local table_name
   local i
   local c_val
   local k_val
   local pad_val
   oltp_db_count = tonumber(oltp_db_count) or 1
   -- local oltp_db_count = 4
   table_name = "sbtest_" .. sb_rand(100, oltp_db_count)..".sbtest".. sb_rand(1, oltp_tables_count)
   k_val = sb_rand(1, oltp_table_size)
   c_val = sb_rand_str([[
###########-###########-###########-###########-###########-###########-###########-###########-###########-###########]])
   pad_val = sb_rand_str([[
###########-###########-###########-###########-###########]])
      rs = db_query("SELECT id FROM " .. table_name .." LIMIT 1")
end

Scripts to reproduce:

1. Create 20K+ tables first

#/bin/bash
function do_db {
        db_exist=$(mysql -A -s -Nbe "select 1 from information_schema.schemata where schema_name = '$db'")
        if [ "$db_exist" == "1" ]; then echo "Already exists $db"; return 0; fi; 
        mysql -vvv -e "create database $db";
        mysql -vvv $db -e "CREATE TABLESPACE $db ADD DATAFILE '$db.ibd' engine=InnoDB;"
        for i in {1..100}
        do
                table="CREATE TABLE sbtest$i ( id int(10) unsigned NOT NULL AUTO_INCREMENT, k int(10) unsigned NOT NULL DEFAULT '0', c varchar(120) NOT NULL DEFAULT '', pad varchar(60) NOT NULL DEFAULT '', PRIMARY KEY (id), KEY k_1 (k) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 tablespace $db;"
                mysql $db -e "$table"
        done
}
c=0
for m in {1..4000000}
do
        for i in {1..40}
        do
                let c=$c+1
                echo $c
                db="sbtest_$c"
                do_db &
        done
        wait
        #if [ $c > 4000000 ]; then exit; fi
done

2. Run this script:

##### shel script, sysbench_mysql.sh
#####

function run_sb() {
        conn=" --db-driver=mysql --mysql-socket=/var/run/mysqld/mysqld.sock  --mysql-db=mysql --mysql-user=sbtest --mysql-password=abc "
        sysbench $conn --oltp_db_count=$db_count --oltp_tables_count=$table_count --oltp-table-size=10000 --report-interval=1 --num-threads=$num_threads --max-requests=0 --max-time=$max_time ./select_custom.lua run | tee -a sysbench_2.txt
}
let db_count=70000
table_count=1000
max_time=10000
num_threads=32
run_sb

#### select_custom.lua
####

pathtest = "/usr/share/sysbench/tests/include/oltp_legacy/"
if pathtest then
   dofile(pathtest .. "common.lua")
else
   require("common")
end
function thread_init(thread_id)
   set_vars()
end
function event()
   local table_name
   local i
   local c_val
   local k_val
   local pad_val
   oltp_db_count = tonumber(oltp_db_count) or 1
   -- local oltp_db_count = 4
   table_name = "sbtest_" .. sb_rand(100, oltp_db_count)..".sbtest".. sb_rand(1, oltp_tables_count)
   k_val = sb_rand(1, oltp_table_size)
   c_val = sb_rand_str([[
###########-###########-###########-###########-###########-###########-###########-###########-###########-###########]])
   pad_val = sb_rand_str([[
###########-###########-###########-###########-###########]])
      rs = db_query("SELECT id FROM " .. table_name .." LIMIT 1")
end

perf top with mysql-debug:

  24.41%  mysqld-debug         [.] _Z23dict_table_add_to_cacheP12dict_table_tmP16mem_block_info_t                                                                                                          
  15.13%  mysqld-debug         [.] _ZNK15CheckInFreeListclEPK10buf_page_t                                                                                                                                  
  12.59%  mysqld-debug         [.] _Z16ut_list_validateI12ut_list_baseI10buf_page_tMS1_12ut_list_nodeIS1_EE15CheckInFreeListEvRKT_RT0_

is there any updated on the bug, looks like issue is still present in latest versions of mysql

I encountered this bug in a production percona cluster running 8.0.25-15.
After some investigation using the innodb status output, I came across this
bug and corresponding blog post. With that I was able to create a local
testcase that reproduced the issue fairly reliably, and then I did some
profiling to figure out where the mutex holder thread was spending time.

I left some more details on percona's github at the time:
https://github.com/percona/percona-server/pull/4825

But, since this issue is present in upstream MySQL, I'm bringing it here.
The caveat I have is that I have not yet tried reproducing this with latest
MySQL, but the code I identified as causing the performance regression is
still present in tip.

'LatchCounter uses a std::vector for storing registered counters, which
leads to n^2 behavior during dict_make_room_in_cache; this can hold the
dict_sys->mutex for multiple seconds at a time in the case where a very
large number of innodb tables (200,000+) have been opened since the last
cleaning round.'

The 'n^2 behavior' is during deregistering a LatchCounter associated with
a table, which happens when freeing a table entry:

storage/innobase/include/sync0types.h:700:
    m_mutex.enter();

    m_counters.erase(std::remove(m_counters.begin(), m_counters.end(), count),
                     m_counters.end());

    m_mutex.exit();

My 'fix' at the time was to simply use unordered_set instead; this prevented
the stalls in testing, but it seemed to have some performance impact. My
measurements weren't scientific enough for me to say whether that impact was
significant.

Use unordered_set for LatchCounter storage (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: 0001-Use-unordered_set-for-LatchCounter-storage.patch (text/x-patch), 2.28 KiB.

After further performance testing, I have concluded that the impact of using
vector instead of unordered_set for LatchCounter storage is not significant.

The change to unordered_set reduces the stall duration from ~20s to ~2s in my
test case:

https://robert-nix.github.io/mysql-loadgen/stall-measurement.html

This change does have a visible effect on program behavior:  the order of mutex
instances from the same source location in the output of SHOW ENGINE INNODB
MUTEX is no longer the order of mutex creation and is instead 'randomized' by
the hashmap iteration.

I did try an alternative structure which preserves order and has similar memory
overhead to the hashmap, a boost::intrusive::list, but I think the added
complexity isn't worth the order preservation in this case. From my
observations, in practice, most of these table mutexes have no spins/waits and
so won't show up in the output, and the order of mutex creation is not a useful
signal since no other metadata is printed alongside the latch status.

I have also done some profiling during the new 2 seconds of stall, and I can't
see any other 'easy' ways to reduce the stall duration.