MySQL Bugs: #115988: Too Much Disk Read on Startup, penalizing deployments with many tables (1M+).

Bug #115988	Too Much Disk Read on Startup, penalizing deployments with many tables (1M+).
Submitted:	3 Sep 2024 15:03	Modified:	4 Sep 2024 10:23
Reporter:	Jean-François Gagné	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	8.0.39, 8.4.2, 9.0.1	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	Contribution, regression

Description:
Hi,

On MySQL startup, the first page of each table is read to get the Space Id.  But reading the first page is not only fetching 16 kilobytes from disk, it is fetching 32 kilobytes, which is problematic with many tables.  See How to repeat for details.  This can be fixed by what I call the "fadvise" optimization, more about this in Suggested fix and contributed patch (I will add the contribution soon).

Above applies to 8.0.39, 8.4.2 and 9.0.1 on Debian 12 with kernel 6.1.99.  It might not be reproducible on other distributions / kernels as prefetching / readahead might be configured differently.  However, this does not change the next part of this bug report.

Reading the whole first page of a table is not needed for getting the Space Id.  By optimizing the reading logic, the amount of data read can be lowered to 4 kilobytes, which is a big win on startup with many tables.  This is what I call the "light" optimization, more about this in Suggested fix and contributed patch (I will add the contribution soon).

With the fadvise and light optimizations, MySQL startup time on an AWS vm with local SSDs is reduced from 2:39 to 1:09 (1 minute 9 seconds), see Suggested fix for testing details.

Setting this report as a Serious / S2 because reading too much from disk or a long MySQL startup are not minor things.  Some could argue that this is an "optimization" / feature request, but I think there is also an argument to this being considered a serious bug and even a regression from 5.7 because, as shown in How to repeat, 5.7 reads fewer data from disk than 8.0 and starts faster.

And one last thing: the "light" optimization will probably not work on filesystems that have a 16-kilobyte block size (like a deployment on zfs).  This means that when avoiding the overhead of the double write buffer, one will have to pay an overhead on startup.  As usual, every optimization has a tradeoff ¯\_(ツ)_/¯.

Many thanks for looking into this,

Jean-François Gagné

How to repeat:
# These tests are done on an AWS m6id.xlarge instance (4 vcpus, 2 cores, 16 GB RAM and 220 GB SSD).

# Linux Kernel version (might be important for fadvise results).
uname -a
Linux ip-172-31-15-73 6.1.0-23-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.99-1 (2024-07-15) x86_64 GNU/Linux

# Debian version (might be important for fadvise results).
cat /etc/debian_version 
12.6

# A few utility functions:
# - status_string;
# - start;
# - stop;
# - create_tables.

function status_string() {
  printf "%6s %9s" $mv $1
}

# For diskstats: https://www.kernel.org/doc/Documentation/ABI/testing/procfs-diskstats
function start() {
  sudo sync
  sudo /usr/bin/bash -c "echo 3 > /proc/sys/vm/drop_caches"

  # I have a dedicated mount point for putting MySQL data dirs.
  #   Adjust below to where your data dir is, ideally not having
  #   other things in there that would influence diskstats.
  local dev=$(mount | grep /mnt/jgagne_sandboxes | cut -d " " -f 1 | cut -d "/" -f 3)
  local diskstats_before="$(cat /proc/diskstats | grep $dev)"

  ( ./start > /dev/null & )
  while sleep 1; do test -e data/mysql_sandbox*.pid && break; done | pv -t -N "$(status_string start)"

  local diskstats_after="$(cat /proc/diskstats | grep $dev)"

  diskstats="$(
    echo -e "$diskstats_before\n$diskstats_after" |
      awk '{print $6-sector_read, "sector read."; sector_read=$6}' |
      tail -n 1)"
  echo "$(status_string "diskstats"): $diskstats"
}

function stop() {
  local pid=$(cat data/mysql_sandbox*.pid)
  ( ./stop > /dev/null & )
  while sleep 1; do ps -p $pid | grep -q $pid || break; done
}

# The quickest way to create tables in 5.7 is by reducing durability.
# The quickest way to create tables in 8.0+ is by reducing durability and disabling redo logging.
function create_tables() {
  local sql_pre="set global sync_binlog=0, innodb_flush_log_at_trx_commit=2;"

  local sql_post="set global innodb_fast_shutdown = 0;"
  sql_post="$sql_post FLUSH BINARY LOGS; do sleep(2); purge binary logs before now();"

  test ${mv:0:1} = "5" || {
    sql_pre="$sql_pre   ALTER INSTANCE DISABLE INNODB REDO_LOG;"
    sql_post="$sql_post ALTER INSTANCE ENABLE  INNODB REDO_LOG;"
  }

  ./use <<< "create database test_jfg; $sql_pre"

  seq -f "CREATE TABLE t%06.0f(i int);" 1 $n | ./use test_jfg | pv -t -N "$(status_string creating)"

  ./use <<< "$sql_post"
}

# The actual test: 1M tables with 5.7 to 9.0.
n=1000000
for mv in 5.7.44 8.0.39 8.4.2 9.0.1; do
  cd; dbdeployer deploy single mysql_$mv > /dev/null
  cd ~/sandboxes/msb_mysql_${mv//./_}

  create_tables; stop; start

  stop; cd; rm -rf ~/sandboxes/msb_mysql_${mv//./_}
done

# Full results.
5.7.44  creating: 0:37:43
5.7.44     start: 0:00:50
5.7.44 diskstats: 2809958 sector read.
8.0.39  creating: 0:39:38
8.0.39     start: 0:02:43
8.0.39 diskstats: 65783020 sector read.
 8.4.2  creating: 0:42:07
 8.4.2     start: 0:02:44
 8.4.2 diskstats: 65685484 sector read.
 9.0.1  creating: 0:31:26
 9.0.1     start: 0:02:43
 9.0.1 diskstats: 65688772 sector read.

# Diskstats results (without the creating and start details).
# For 8.0.39, 8.4.2 and 9.0.1, 65+ million sectors are read while 32 per table is expected (one sector is 512 bytes).
# Below also shows the regression from 5.7 to 8.0 needing to read 20x sectors from disk.
5.7.44 diskstats: 2809958 sector read.
8.0.39 diskstats: 65783020 sector read.
 8.4.2 diskstats: 65685484 sector read.
 9.0.1 diskstats: 65688772 sector read.

# Start results (without the creating and diskstats details).
# Below shows the regression from 5.7 to 8.0 needing 3x time to start.
5.7.44     start: 0:00:50
8.0.39     start: 0:02:43
 8.4.2     start: 0:02:44
 9.0.1     start: 0:02:43

Suggested fix:
# These tests are done on an AWS m6id.xlarge instance (4 vcpus, 2 cores, 16 GB RAM and 220 GB SSD).

# Same kernel and Debian version as in How to repeat.

# These two global variables allow enabling the fadvise and light optimizations in the patched binary:
# - innodb_tablespace_startup_testing_fadvise;
# - innodb_tablespace_startup_testing_light.

# Below, I use a tgz of a data dir where I previously created 1 million tables.
# This tgz can easily be taken by extrapolating from How to repeat,
#   or below can be modified to create tables based on How to repeat.
ls -l ~/sandboxes/mysql_9.0.1.data.1000000.tgz 
-rw-r--r-- 1 jgagne jgagne 2464569915 Sep  1 15:43 /home/jgagne/sandboxes/mysql_9.0.1.data.1000000.tgz

# In my dbdeployer binary directory, I renamed mysqld to mysqld_org,
#   and I put my compiled binary there as mysqld_explo.
ls -l ~/opt/mysql/mysql_9.0.1/bin/mysqld{,_org,_explo}
lrwxrwxrwx 1 jgagne jgagne        10 Sep  1 15:49 /home/jgagne/opt/mysql/mysql_9.0.1/bin/mysqld -> mysqld_org
-rwxr-xr-x 1 jgagne jgagne 277748456 Aug 29 14:32 /home/jgagne/opt/mysql/mysql_9.0.1/bin/mysqld_explo
-rwxr-xr-x 1 jgagne jgagne  76973576 Aug 29 14:31 /home/jgagne/opt/mysql/mysql_9.0.1/bin/mysqld_org

# Needs the start and stop functions from How to repeat.

# A few additional utility functions:
# - status_string (new version);
# - set_bin;
# - stop_wipe_restore.

function status_string() {
  printf "%6s %13s %9s" $mv $str $1
}

# Below function is to swap between binaries.
function set_bin() {
  ( # In a sub-shell to not have to undo cd.
    cd ~/opt/mysql/mysql_$mv/bin
    rm -f mysqld
    ln -s mysqld_$1 mysqld
  )
}

function stop_wipe_restore() {
  stop
  rm -rf data
  set_bin $1
  str=$2
  pv -te ~/sandboxes/mysql_${mv}.data.${n}.tgz -N "$(status_string restore)" | tar -zx
}

# The actual test: 1M tables with 9.0.1, org and explo.
n=100000; mv="9.0.1"; {
  set_bin org
  cd; dbdeployer deploy single mysql_$mv > /dev/null
  cd ~/sandboxes/msb_mysql_${mv//./_}

  stop_wipe_restore org   default_org;   start
  stop_wipe_restore explo default_explo; start

  echo "innodb_tablespace_startup_testing_fadvise=ON" >> my.sandbox.cnf
  stop_wipe_restore explo fadvise; start

  echo "innodb_tablespace_startup_testing_light=ON"   >> my.sandbox.cnf
  stop_wipe_restore explo light;   start

  stop; cd; rm -rf ~/sandboxes/msb_mysql_${mv//./_}
  set_bin org
}

# Full results.
 9.0.1   default_org   restore: 0:13:03
 9.0.1   default_org     start: 0:02:45
 9.0.1   default_org diskstats: 65695545 sector read.
 9.0.1 default_explo   restore: 0:13:01
 9.0.1 default_explo     start: 0:02:39
 9.0.1 default_explo diskstats: 65695113 sector read.
 9.0.1       fadvise   restore: 0:13:02
 9.0.1       fadvise     start: 0:01:46
 9.0.1       fadvise diskstats: 33695617 sector read.
 9.0.1         light   restore: 0:13:01
 9.0.1         light     start: 0:01:09
 9.0.1         light diskstats: 9694921 sector read.

# Diskstats results (without the restore and start details).
# The fadvise optimization divides sector read by two, and the light by an additional four.
 9.0.1   default_org diskstats: 65695545 sector read.
 9.0.1 default_explo diskstats: 65695113 sector read.
 9.0.1       fadvise diskstats: 33695617 sector read.
 9.0.1         light diskstats: 9694921 sector read.

# Start results (without the restore and diskstats details).
# The explo binary has a similar "default" startup time,
#.  and fadvise reduces it to 1:46 (from 2:39), and light to 1:09.
 9.0.1   default_org     start: 0:02:45
 9.0.1 default_explo     start: 0:02:39
 9.0.1       fadvise     start: 0:01:46
 9.0.1         light     start: 0:01:09

Updating title (sorry for small mistake):
- now: Too Much Disk Read on Startup, penalizing deployments with many tables (1M+)
- was: Too Much Disk Read on Startup, penalizing deployments many tables (1M+)

More about this contribution in https://github.com/jfg956/mysql-server/pull/14 (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: bug#115988.patch (application/octet-stream, text), 15.98 KiB.

Some notes about my above contribution extracted from https://github.com/jfg956/mysql-server/pull/14

This PR merges on 9.0.1, and the corresponding patch file applies to 8.4.2 and 8.0.39.  Because this PR significantly improves MySQL startup, I think it should be back ported to 8.0 and 8.4, and if it is considered a "breaking change", with feature flags / global variables to enable the optimizations, disabled by default.

This PR introduces two testing flags / global variables, disabled by default: innodb_tablespace_startup_testing_fadvise and innodb_tablespace_startup_testing_light.  They allow enabling the optimizations introduced by this PR.  When merging in 9.0, I assumed it did not make sense to keep these, hence "testing" naming and limited documentation.  If it is decided they are needed, they should be renamed and documented adequately, which I can do if asked by Oracle.

While doing this work, I "played with fire" and refactored a few things.  They are identified as "I / JFG had a visceral reaction ..." as comments in the code.  These and other comments are in the code for the clarity of the PR, they should be removed / adjusted when merging.

Hello Jean-François,

Thank you for the report and contribution.

regards,
Umesh