Description:
Hi,
On MySQL startup, the first page of each table is read to get the Space Id. But reading the first page is not only fetching 16 kilobytes from disk, it is fetching 32 kilobytes, which is problematic with many tables. See How to repeat for details. This can be fixed by what I call the "fadvise" optimization, more about this in Suggested fix and contributed patch (I will add the contribution soon).
Above applies to 8.0.39, 8.4.2 and 9.0.1 on Debian 12 with kernel 6.1.99. It might not be reproducible on other distributions / kernels as prefetching / readahead might be configured differently. However, this does not change the next part of this bug report.
Reading the whole first page of a table is not needed for getting the Space Id. By optimizing the reading logic, the amount of data read can be lowered to 4 kilobytes, which is a big win on startup with many tables. This is what I call the "light" optimization, more about this in Suggested fix and contributed patch (I will add the contribution soon).
With the fadvise and light optimizations, MySQL startup time on an AWS vm with local SSDs is reduced from 2:39 to 1:09 (1 minute 9 seconds), see Suggested fix for testing details.
Setting this report as a Serious / S2 because reading too much from disk or a long MySQL startup are not minor things. Some could argue that this is an "optimization" / feature request, but I think there is also an argument to this being considered a serious bug and even a regression from 5.7 because, as shown in How to repeat, 5.7 reads fewer data from disk than 8.0 and starts faster.
And one last thing: the "light" optimization will probably not work on filesystems that have a 16-kilobyte block size (like a deployment on zfs). This means that when avoiding the overhead of the double write buffer, one will have to pay an overhead on startup. As usual, every optimization has a tradeoff ¯\_(ツ)_/¯.
Many thanks for looking into this,
Jean-François Gagné
How to repeat:
# These tests are done on an AWS m6id.xlarge instance (4 vcpus, 2 cores, 16 GB RAM and 220 GB SSD).
# Linux Kernel version (might be important for fadvise results).
uname -a
Linux ip-172-31-15-73 6.1.0-23-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.99-1 (2024-07-15) x86_64 GNU/Linux
# Debian version (might be important for fadvise results).
cat /etc/debian_version
12.6
# A few utility functions:
# - status_string;
# - start;
# - stop;
# - create_tables.
function status_string() {
printf "%6s %9s" $mv $1
}
# For diskstats: https://www.kernel.org/doc/Documentation/ABI/testing/procfs-diskstats
function start() {
sudo sync
sudo /usr/bin/bash -c "echo 3 > /proc/sys/vm/drop_caches"
# I have a dedicated mount point for putting MySQL data dirs.
# Adjust below to where your data dir is, ideally not having
# other things in there that would influence diskstats.
local dev=$(mount | grep /mnt/jgagne_sandboxes | cut -d " " -f 1 | cut -d "/" -f 3)
local diskstats_before="$(cat /proc/diskstats | grep $dev)"
( ./start > /dev/null & )
while sleep 1; do test -e data/mysql_sandbox*.pid && break; done | pv -t -N "$(status_string start)"
local diskstats_after="$(cat /proc/diskstats | grep $dev)"
diskstats="$(
echo -e "$diskstats_before\n$diskstats_after" |
awk '{print $6-sector_read, "sector read."; sector_read=$6}' |
tail -n 1)"
echo "$(status_string "diskstats"): $diskstats"
}
function stop() {
local pid=$(cat data/mysql_sandbox*.pid)
( ./stop > /dev/null & )
while sleep 1; do ps -p $pid | grep -q $pid || break; done
}
# The quickest way to create tables in 5.7 is by reducing durability.
# The quickest way to create tables in 8.0+ is by reducing durability and disabling redo logging.
function create_tables() {
local sql_pre="set global sync_binlog=0, innodb_flush_log_at_trx_commit=2;"
local sql_post="set global innodb_fast_shutdown = 0;"
sql_post="$sql_post FLUSH BINARY LOGS; do sleep(2); purge binary logs before now();"
test ${mv:0:1} = "5" || {
sql_pre="$sql_pre ALTER INSTANCE DISABLE INNODB REDO_LOG;"
sql_post="$sql_post ALTER INSTANCE ENABLE INNODB REDO_LOG;"
}
./use <<< "create database test_jfg; $sql_pre"
seq -f "CREATE TABLE t%06.0f(i int);" 1 $n | ./use test_jfg | pv -t -N "$(status_string creating)"
./use <<< "$sql_post"
}
# The actual test: 1M tables with 5.7 to 9.0.
n=1000000
for mv in 5.7.44 8.0.39 8.4.2 9.0.1; do
cd; dbdeployer deploy single mysql_$mv > /dev/null
cd ~/sandboxes/msb_mysql_${mv//./_}
create_tables; stop; start
stop; cd; rm -rf ~/sandboxes/msb_mysql_${mv//./_}
done
# Full results.
5.7.44 creating: 0:37:43
5.7.44 start: 0:00:50
5.7.44 diskstats: 2809958 sector read.
8.0.39 creating: 0:39:38
8.0.39 start: 0:02:43
8.0.39 diskstats: 65783020 sector read.
8.4.2 creating: 0:42:07
8.4.2 start: 0:02:44
8.4.2 diskstats: 65685484 sector read.
9.0.1 creating: 0:31:26
9.0.1 start: 0:02:43
9.0.1 diskstats: 65688772 sector read.
# Diskstats results (without the creating and start details).
# For 8.0.39, 8.4.2 and 9.0.1, 65+ million sectors are read while 32 per table is expected (one sector is 512 bytes).
# Below also shows the regression from 5.7 to 8.0 needing to read 20x sectors from disk.
5.7.44 diskstats: 2809958 sector read.
8.0.39 diskstats: 65783020 sector read.
8.4.2 diskstats: 65685484 sector read.
9.0.1 diskstats: 65688772 sector read.
# Start results (without the creating and diskstats details).
# Below shows the regression from 5.7 to 8.0 needing 3x time to start.
5.7.44 start: 0:00:50
8.0.39 start: 0:02:43
8.4.2 start: 0:02:44
9.0.1 start: 0:02:43
Suggested fix:
# These tests are done on an AWS m6id.xlarge instance (4 vcpus, 2 cores, 16 GB RAM and 220 GB SSD).
# Same kernel and Debian version as in How to repeat.
# These two global variables allow enabling the fadvise and light optimizations in the patched binary:
# - innodb_tablespace_startup_testing_fadvise;
# - innodb_tablespace_startup_testing_light.
# Below, I use a tgz of a data dir where I previously created 1 million tables.
# This tgz can easily be taken by extrapolating from How to repeat,
# or below can be modified to create tables based on How to repeat.
ls -l ~/sandboxes/mysql_9.0.1.data.1000000.tgz
-rw-r--r-- 1 jgagne jgagne 2464569915 Sep 1 15:43 /home/jgagne/sandboxes/mysql_9.0.1.data.1000000.tgz
# In my dbdeployer binary directory, I renamed mysqld to mysqld_org,
# and I put my compiled binary there as mysqld_explo.
ls -l ~/opt/mysql/mysql_9.0.1/bin/mysqld{,_org,_explo}
lrwxrwxrwx 1 jgagne jgagne 10 Sep 1 15:49 /home/jgagne/opt/mysql/mysql_9.0.1/bin/mysqld -> mysqld_org
-rwxr-xr-x 1 jgagne jgagne 277748456 Aug 29 14:32 /home/jgagne/opt/mysql/mysql_9.0.1/bin/mysqld_explo
-rwxr-xr-x 1 jgagne jgagne 76973576 Aug 29 14:31 /home/jgagne/opt/mysql/mysql_9.0.1/bin/mysqld_org
# Needs the start and stop functions from How to repeat.
# A few additional utility functions:
# - status_string (new version);
# - set_bin;
# - stop_wipe_restore.
function status_string() {
printf "%6s %13s %9s" $mv $str $1
}
# Below function is to swap between binaries.
function set_bin() {
( # In a sub-shell to not have to undo cd.
cd ~/opt/mysql/mysql_$mv/bin
rm -f mysqld
ln -s mysqld_$1 mysqld
)
}
function stop_wipe_restore() {
stop
rm -rf data
set_bin $1
str=$2
pv -te ~/sandboxes/mysql_${mv}.data.${n}.tgz -N "$(status_string restore)" | tar -zx
}
# The actual test: 1M tables with 9.0.1, org and explo.
n=100000; mv="9.0.1"; {
set_bin org
cd; dbdeployer deploy single mysql_$mv > /dev/null
cd ~/sandboxes/msb_mysql_${mv//./_}
stop_wipe_restore org default_org; start
stop_wipe_restore explo default_explo; start
echo "innodb_tablespace_startup_testing_fadvise=ON" >> my.sandbox.cnf
stop_wipe_restore explo fadvise; start
echo "innodb_tablespace_startup_testing_light=ON" >> my.sandbox.cnf
stop_wipe_restore explo light; start
stop; cd; rm -rf ~/sandboxes/msb_mysql_${mv//./_}
set_bin org
}
# Full results.
9.0.1 default_org restore: 0:13:03
9.0.1 default_org start: 0:02:45
9.0.1 default_org diskstats: 65695545 sector read.
9.0.1 default_explo restore: 0:13:01
9.0.1 default_explo start: 0:02:39
9.0.1 default_explo diskstats: 65695113 sector read.
9.0.1 fadvise restore: 0:13:02
9.0.1 fadvise start: 0:01:46
9.0.1 fadvise diskstats: 33695617 sector read.
9.0.1 light restore: 0:13:01
9.0.1 light start: 0:01:09
9.0.1 light diskstats: 9694921 sector read.
# Diskstats results (without the restore and start details).
# The fadvise optimization divides sector read by two, and the light by an additional four.
9.0.1 default_org diskstats: 65695545 sector read.
9.0.1 default_explo diskstats: 65695113 sector read.
9.0.1 fadvise diskstats: 33695617 sector read.
9.0.1 light diskstats: 9694921 sector read.
# Start results (without the restore and diskstats details).
# The explo binary has a similar "default" startup time,
#. and fadvise reduces it to 1:46 (from 2:39), and light to 1:09.
9.0.1 default_org start: 0:02:45
9.0.1 default_explo start: 0:02:39
9.0.1 fadvise start: 0:01:46
9.0.1 light start: 0:01:09