MySQL Bugs: #116766: Unexpected large number of dirty pages were flushed during crash recovery

Bug #116766	Unexpected large number of dirty pages were flushed during crash recovery
Submitted:	25 Nov 2024 3:46	Modified:	25 Nov 2024 11:43
Reporter:	Zihao Wang (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S4 (Feature request)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Recently I was testing performance after crash recovery using sysbench. And I found that when concurrency is high, the tps in first 30s will go low unexpectedly. Through our team's performance monitoring tool, I found that the reason for the decrease in tps was the abnormal amount of dirty pages flushing. 

My test is running on a ECS of alibaba cloud, the disk I using is an essd with 350MB/s I/O thoughput. About 20000 pages flushing in a second could full the disk thoughput. So the unexpected dirty pages flushing can make a great influence to the performance.

========================================

By reading the code and observing the monitoring information in information_schema.innodb_metrics, I located the root cause of the problem. So let me first explain the reason of this bug through code.

1. First and most important, innodb will calc lsn_avg_rate every innodb_flushing_avg_loops(default 30) seconds. When server started up from crash recovery, the value of lsn_avg_rate is always 0 in first innodb_flushing_avg_loops seconds.

2. In Adaptive_flush::set_flush_target_by_lsn, innodb will calc target_lsn in the way below:
"
  target_lsn = oldest_lsn + lsn_avg_rate * buf_flush_lsn_scan_factor;
"
Since lsn_avg_rate is 0, target_lsn is equal to oldest_lsn, and there are almost no dirty pages with LSN less than this value.

3. Because of the unexpected value of target_lsn, innodb will calc sum_pages_for_lsn with a extreme low value, and page_cleaner->slots[i].n_pages_requested probably equals to 1:
"
for (ulint i = 0; i < srv_buf_pool_instances; i++) {
  // ...
  for (buf_page_t *b = UT_LIST_GET_LAST(buf_pool->flush_list); b != nullptr;
         b = UT_LIST_GET_PREV(list, b)) {
      if (b->get_oldest_lsn() > target_lsn) {
        break;
      }
      ++pages_for_lsn;
      // ...
  }
  sum_pages_for_lsn += pages_for_lsn;
  // ...
  page_cleaner->slots[i].n_pages_requested = pages_for_lsn / scan_factor + 1;
  // ...
}
"

4. To prevent from divide 0, innodb will set sum_pages_for_lsn to 1:
"
  sum_pages_for_lsn /= scan_factor;
  if (sum_pages_for_lsn < 1) {
    sum_pages_for_lsn = 1;
  }
"

5. At last, innodb now need to compute the exact value of page_cleaner->slots[i].n_pages_requested:
"
  for (ulint i = 0; i < srv_buf_pool_instances; i++) {
    /* if REDO has enough of free space,
    don't care about age distribution of pages */
    page_cleaner->slots[i].n_pages_requested =
        pct_for_lsn > 30 ? page_cleaner->slots[i].n_pages_requested * n_pages /
                                   sum_pages_for_lsn +
                               1
                         : n_pages / srv_buf_pool_instances + 1;
  }
"
You can see that, when pct_for_lsn > 30, innodb will consider age distribution of pages. In the fomula, `page_cleaner->slots[i].n_pages_requested / sum_pages_for_lsn` can represent the age distribution. But in this case, it will be 1 for every page cleaner.

Then you will find that, instead of distributing `n_pages` to every page cleaner, EVERY page cleaner will flush `n_pages`! So the amount of flushing of dirty pages is actually `innodb_page_cleaners * n_pages`!

How to repeat:
1. You need a monitor to get the exact value of flushed pages every second.

2. Enable innodb_metrics table in my.cnf:
"
innodb_monitor_enable = 'module_buffer'
"

3. Using sysbench:
 3.1 prepare
 3.2 warm up 200s
 3.3 kill mysqld and restart
 3.4 connect to mysqld once server ready for connection
 3.5 run high concurrency test, test mode is better to be oltp_write_only or oltp_read_write which create a lot of Redo and push LSN fast

4. Observe information_schema.innodb_metrics, you will find when the value of `buffer_flush_pct_for_lsn` over 30, innodb will flush pages much more than `buffer_flush_n_to_flush_requested`. And you can also find that, in first 30s, the value of `buffer_flush_n_to_flush_by_age` is always 1, the value of `buffer_flush_lsn_avg_rate` and `buffer_flush_avg_page_rate`.

If you find it hard to repeat, you can try to decrease the size of you redo log capacity, to make it more easier to flush with `pct_for_lsn > 30`. Meanwhile, you can try to larger the size of your BP, to make sure that innodb always flush dirty pages adaptive to LSN.

========================================
I can give you my test.sh, you can modify and use:
"
EXECDIR="/root/sbtest"
test_name=$1
table_size_set=4000000
table_num=10
warm_time=200
run_time=200
instances=('x.x.x.x:3306')
modes=('oltp_write_only')
conns=('128')

EXEC=/u01/mysql80_current/bin/mysqld
LOG_DIR=/u01/3306/log/mysql
CNF=/u01/3306/my.cnf
EXT_OPT="--user=root"
TARGET_ECS_IP=x.x.x.x

kill_mysqld()
{
    mysqld_pid=$(ssh root@$TARGET_ECS_IP "pidof mysqld")
    if [ $mysqld_pid -ne 0 ];then
        ssh root@$TARGET_ECS_IP "kill -9 $mysqld_pid"
    fi
}

start_and_wait_for_ready()
{
    LOG=$LOG_DIR/$1.log
    SYSBENCH_LOG=${EXECDIR}/$test_name/$1.log.log
    ssh root@$TARGET_ECS_IP "rm -rf $LOG"
    ssh root@$TARGET_ECS_IP "$EXEC --defaults-file=$CNF $EXT_OPT --log-error=$LOG &"

    START_DONE=0
    while [ $START_DONE -eq 0 ]
    do
        sleep 0.5

        START_DONE=$(ssh root@$TARGET_ECS_IP "grep -c 'mysqld:\ ready\ for\ connections' $LOG")
    done

    echo "=========TIMESTAMP=========" >> $SYSBENCH_LOG
    ssh root@$TARGET_ECS_IP "grep '\[RTO\]' $LOG" >> $SYSBENCH_LOG
    ssh root@$TARGET_ECS_IP "grep 'mysqld: ready for connections' $LOG" >> $SYSBENCH_LOG
    echo "=========TIMESTAMP=========" >> $SYSBENCH_LOG
}

if [ "$test_name" == "" ];then
    echo "Test name not given."
fi

mkdir -p ${EXECDIR}/$test_name

for ip_port in ${instances[*]}
do
    host=`echo $ip_port | awk -F ':' '{print $1}'`
    port=`echo $ip_port | awk -F ':' '{print $2}'`

    prepare_cmd="sysbench oltp_read_write --mysql-host=$host --mysql-port=$port --mysql-user=test --mysql-password=test123 --tables=$table_num --threads=$table_num --table_size=$table_size_set --mysql-db=sbtest --rand-type=special --rand-spec-pct=15 prepare"
    $prepare_cmd > ${EXECDIR}/$test_name/${test_name}.prepare

    for mode in ${modes[*]}
    do
        cmd="sysbench ${mode} --mysql-host=$host --mysql-port=$port --mysql-user=test --mysql-password=test123 --tables=$table_num --table_size=$table_size_set --report-interval=1 --mysql-db=sbtest --rand-type=special --rand-spec-pct=15"
        conn='0'
        warm_cmd="$cmd --threads=128 --time=$warm_time run"
        file_name="${EXECDIR}/$test_name/${test_name}-${mode}.log"
        $warm_cmd > ${file_name}.log
        for conn in ${conns[*]}
        do
            file_name="${EXECDIR}/$test_name/${test_name}-${mode}-${conn}-${host}.log"

            kill_mysqld
            start_and_wait_for_ready "${test_name}-${mode}-${conn}-${host}"

            run_cmd="$cmd --threads=$conn --time=$run_time run"
            echo "$run_cmd" >> ${file_name}.log
            $run_cmd | ts >> ${file_name}.log
        done
    done
done
" 

========================================
Also, you can use this shell scripts to monitor innodb_metrics:
"
TARGET_ECS_IP=x.x.x.x
BASE_DIR="/root/sbtest"
log_name=$1
OUTPUT_DIR=$BASE_DIR/$log_name
mysql_cmd="mysql -h$TARGET_ECS_IP -utest -P3306 -ptest123"
MONITOR_SQL="SELECT name, count FROM INFORMATION_SCHEMA.INNODB_METRICS WHERE name like 'buffer_flush%';"

while true
do
    echo "$MONITOR_SQL" | $mysql_cmd | ts >> $OUTPUT_DIR
    sleep 1
done
"

Suggested fix:
1. Maybe we can fix this bug by handling the unexpected situation of `pct_for_lsn > 30 && page_cleaner->slots[i].n_pages_requested / sum_pages_for_lsn == 1`???

2. Add a monitor to the exact value of innodb_flush_pages(sum the page_cleaner->slots[i].n_pages_requested), and make sure it is close to `buffer_flush_n_to_flush_requested`.

HI Mr. Wang,

Thank you for your bug report.

We have managed to repeat partially  the behaviour that you reported. That might be because we always test on the server that is optimally configured.

Hence, repeating fully the  behaviour on sub-optimally configured server should not be a problem.

However, we do not think that this is a bug.

This is not even a performance improvement report, since the performance improvement during crash recovery is far, far less important compared to the success of the crash recovery.

This ls, however,  a great feature request.

Verified as a feature request for the versions 8.0 and all higher versions.