Description:
Recently I was testing performance after crash recovery using sysbench. And I found that when concurrency is high, the tps in first 30s will go low unexpectedly. Through our team's performance monitoring tool, I found that the reason for the decrease in tps was the abnormal amount of dirty pages flushing.
My test is running on a ECS of alibaba cloud, the disk I using is an essd with 350MB/s I/O thoughput. About 20000 pages flushing in a second could full the disk thoughput. So the unexpected dirty pages flushing can make a great influence to the performance.
========================================
By reading the code and observing the monitoring information in information_schema.innodb_metrics, I located the root cause of the problem. So let me first explain the reason of this bug through code.
1. First and most important, innodb will calc lsn_avg_rate every innodb_flushing_avg_loops(default 30) seconds. When server started up from crash recovery, the value of lsn_avg_rate is always 0 in first innodb_flushing_avg_loops seconds.
2. In Adaptive_flush::set_flush_target_by_lsn, innodb will calc target_lsn in the way below:
"
target_lsn = oldest_lsn + lsn_avg_rate * buf_flush_lsn_scan_factor;
"
Since lsn_avg_rate is 0, target_lsn is equal to oldest_lsn, and there are almost no dirty pages with LSN less than this value.
3. Because of the unexpected value of target_lsn, innodb will calc sum_pages_for_lsn with a extreme low value, and page_cleaner->slots[i].n_pages_requested probably equals to 1:
"
for (ulint i = 0; i < srv_buf_pool_instances; i++) {
// ...
for (buf_page_t *b = UT_LIST_GET_LAST(buf_pool->flush_list); b != nullptr;
b = UT_LIST_GET_PREV(list, b)) {
if (b->get_oldest_lsn() > target_lsn) {
break;
}
++pages_for_lsn;
// ...
}
sum_pages_for_lsn += pages_for_lsn;
// ...
page_cleaner->slots[i].n_pages_requested = pages_for_lsn / scan_factor + 1;
// ...
}
"
4. To prevent from divide 0, innodb will set sum_pages_for_lsn to 1:
"
sum_pages_for_lsn /= scan_factor;
if (sum_pages_for_lsn < 1) {
sum_pages_for_lsn = 1;
}
"
5. At last, innodb now need to compute the exact value of page_cleaner->slots[i].n_pages_requested:
"
for (ulint i = 0; i < srv_buf_pool_instances; i++) {
/* if REDO has enough of free space,
don't care about age distribution of pages */
page_cleaner->slots[i].n_pages_requested =
pct_for_lsn > 30 ? page_cleaner->slots[i].n_pages_requested * n_pages /
sum_pages_for_lsn +
1
: n_pages / srv_buf_pool_instances + 1;
}
"
You can see that, when pct_for_lsn > 30, innodb will consider age distribution of pages. In the fomula, `page_cleaner->slots[i].n_pages_requested / sum_pages_for_lsn` can represent the age distribution. But in this case, it will be 1 for every page cleaner.
Then you will find that, instead of distributing `n_pages` to every page cleaner, EVERY page cleaner will flush `n_pages`! So the amount of flushing of dirty pages is actually `innodb_page_cleaners * n_pages`!
How to repeat:
1. You need a monitor to get the exact value of flushed pages every second.
2. Enable innodb_metrics table in my.cnf:
"
innodb_monitor_enable = 'module_buffer'
"
3. Using sysbench:
3.1 prepare
3.2 warm up 200s
3.3 kill mysqld and restart
3.4 connect to mysqld once server ready for connection
3.5 run high concurrency test, test mode is better to be oltp_write_only or oltp_read_write which create a lot of Redo and push LSN fast
4. Observe information_schema.innodb_metrics, you will find when the value of `buffer_flush_pct_for_lsn` over 30, innodb will flush pages much more than `buffer_flush_n_to_flush_requested`. And you can also find that, in first 30s, the value of `buffer_flush_n_to_flush_by_age` is always 1, the value of `buffer_flush_lsn_avg_rate` and `buffer_flush_avg_page_rate`.
If you find it hard to repeat, you can try to decrease the size of you redo log capacity, to make it more easier to flush with `pct_for_lsn > 30`. Meanwhile, you can try to larger the size of your BP, to make sure that innodb always flush dirty pages adaptive to LSN.
========================================
I can give you my test.sh, you can modify and use:
"
EXECDIR="/root/sbtest"
test_name=$1
table_size_set=4000000
table_num=10
warm_time=200
run_time=200
instances=('x.x.x.x:3306')
modes=('oltp_write_only')
conns=('128')
EXEC=/u01/mysql80_current/bin/mysqld
LOG_DIR=/u01/3306/log/mysql
CNF=/u01/3306/my.cnf
EXT_OPT="--user=root"
TARGET_ECS_IP=x.x.x.x
kill_mysqld()
{
mysqld_pid=$(ssh root@$TARGET_ECS_IP "pidof mysqld")
if [ $mysqld_pid -ne 0 ];then
ssh root@$TARGET_ECS_IP "kill -9 $mysqld_pid"
fi
}
start_and_wait_for_ready()
{
LOG=$LOG_DIR/$1.log
SYSBENCH_LOG=${EXECDIR}/$test_name/$1.log.log
ssh root@$TARGET_ECS_IP "rm -rf $LOG"
ssh root@$TARGET_ECS_IP "$EXEC --defaults-file=$CNF $EXT_OPT --log-error=$LOG &"
START_DONE=0
while [ $START_DONE -eq 0 ]
do
sleep 0.5
START_DONE=$(ssh root@$TARGET_ECS_IP "grep -c 'mysqld:\ ready\ for\ connections' $LOG")
done
echo "=========TIMESTAMP=========" >> $SYSBENCH_LOG
ssh root@$TARGET_ECS_IP "grep '\[RTO\]' $LOG" >> $SYSBENCH_LOG
ssh root@$TARGET_ECS_IP "grep 'mysqld: ready for connections' $LOG" >> $SYSBENCH_LOG
echo "=========TIMESTAMP=========" >> $SYSBENCH_LOG
}
if [ "$test_name" == "" ];then
echo "Test name not given."
fi
mkdir -p ${EXECDIR}/$test_name
for ip_port in ${instances[*]}
do
host=`echo $ip_port | awk -F ':' '{print $1}'`
port=`echo $ip_port | awk -F ':' '{print $2}'`
prepare_cmd="sysbench oltp_read_write --mysql-host=$host --mysql-port=$port --mysql-user=test --mysql-password=test123 --tables=$table_num --threads=$table_num --table_size=$table_size_set --mysql-db=sbtest --rand-type=special --rand-spec-pct=15 prepare"
$prepare_cmd > ${EXECDIR}/$test_name/${test_name}.prepare
for mode in ${modes[*]}
do
cmd="sysbench ${mode} --mysql-host=$host --mysql-port=$port --mysql-user=test --mysql-password=test123 --tables=$table_num --table_size=$table_size_set --report-interval=1 --mysql-db=sbtest --rand-type=special --rand-spec-pct=15"
conn='0'
warm_cmd="$cmd --threads=128 --time=$warm_time run"
file_name="${EXECDIR}/$test_name/${test_name}-${mode}.log"
$warm_cmd > ${file_name}.log
for conn in ${conns[*]}
do
file_name="${EXECDIR}/$test_name/${test_name}-${mode}-${conn}-${host}.log"
kill_mysqld
start_and_wait_for_ready "${test_name}-${mode}-${conn}-${host}"
run_cmd="$cmd --threads=$conn --time=$run_time run"
echo "$run_cmd" >> ${file_name}.log
$run_cmd | ts >> ${file_name}.log
done
done
done
"
========================================
Also, you can use this shell scripts to monitor innodb_metrics:
"
TARGET_ECS_IP=x.x.x.x
BASE_DIR="/root/sbtest"
log_name=$1
OUTPUT_DIR=$BASE_DIR/$log_name
mysql_cmd="mysql -h$TARGET_ECS_IP -utest -P3306 -ptest123"
MONITOR_SQL="SELECT name, count FROM INFORMATION_SCHEMA.INNODB_METRICS WHERE name like 'buffer_flush%';"
while true
do
echo "$MONITOR_SQL" | $mysql_cmd | ts >> $OUTPUT_DIR
sleep 1
done
"
Suggested fix:
1. Maybe we can fix this bug by handling the unexpected situation of `pct_for_lsn > 30 && page_cleaner->slots[i].n_pages_requested / sum_pages_for_lsn == 1`???
2. Add a monitor to the exact value of innodb_flush_pages(sum the page_cleaner->slots[i].n_pages_requested), and make sure it is close to `buffer_flush_n_to_flush_requested`.