Description:
This issue is a bit like what this patch (809bc0e2bfda1033210a7b14265d984a7736ec65) aimed to fix.
commit 809bc0e2bfda1033210a7b14265d984a7736ec65
Author: Shaohua Wang <shaohua.wang@oracle.com>
Date: Tue Apr 28 14:17:05 2015 +0800
Bug#19865673 DDL LIKE ADD INDEX IS VERY SLOW IN 5.7.5
Analysis:
We have a checkpoint at the end of bulk load to flush all dirty pages to disk,
because we disable redo logging for bulk load. It works well for big table, but
not for small table in heavy work load.
Solution:
Only flush the dirty pages modified by the bulk load. the main idea is to set a
flush observer to these dirty pages when committing mtr during a bulk load. When
a page is submitted to flush or a page is flushed to disk, it notifies the
observer. We scan flush list in buffer pool and flush dirty pages with the
observer at the end of the bulk load, and wait for all the flushes to finish.
Reviewed-by: Jimmy Yang <jimmy.yang@oracle.com>
RB: 8238
Currently FlushObserver is to make sure dirty pages modified by bulk load are flushed to disk, before finishing DDL. But the flushing is not efficient.
Check logic in buf_flush_or_remove_pages(), whenever a dirty page is successfully flushed to disk, it has to restart scanning from the tail of flush_list.
If the flush list is very long (under heavy load), and the dirty pages modified by bulk load are added near the head of flush list, the flushing will be very slow, approximately O(n*n).
How to repeat:
Check logic in buf_flush_or_remove_pages().
We observed this issue in our pro environment, and I constructed a perf test to show this issue:
1. innodb_buffer_pool_size = 128G
2. prepare data using sysbench with --tables=16 --table_size=50000000, make the data bigger than BP
3. run sysbench oltp_update_non_index for a while, and make sure most BP pages are dirtied
------innodb bp pages status-----
data free dirty flush dity%|
8015376 2670 7445445 18624 92.9
8015732 2302 7444745 11409 92.9
8016224 1806 7446387 7341 92.9
8015949 2081 7447139 11325 92.9
8016349 1681 7446657 7290 92.9
8016334 1715 7445316 11236 92.9
8016175 1881 7443963 22475 92.9
8015580 2482 7446586 7324 92.9
4. alter add index on another small table. In my case, t1 is like a sysbench table, with 2000000 record, about 500MB
mysql> alter table t1 add index d_k_1(c, k);
Query OK, 0 rows affected (12 min 19.39 sec)
Records: 0 Duplicates: 0 Warnings: 0
This alter will take 12min to finish.
The perf result shows buf_flush_dirty_pages is very hot.
+ 35.30% mysqld mysqld [.] ut_delay(unsigned long)
- 9.26% mysqld mysqld [.] buf_flush_dirty_pages(buf_pool_t*, unsigned int, FlushObserver*, bo
buf_flush_dirty_pages(buf_pool_t*, unsigned int, FlushObserver*, bool, trx_t const*, bool) [clone .constpro
buf_LRU_flush_or_remove_pages(unsigned int, buf_remove_t, trx_t const*, bool)
FlushObserver::flush()
row_merge_build_indexes(trx_t*, dict_table_t*, dict_table_t*, bool, dict_index_t**, unsigned long const*, u
bool ha_innobase::inplace_alter_table_impl<dd::Table>(TABLE*, Alter_inplace_info*, dd::Table const*, dd::Ta
ha_innobase::inplace_alter_table(TABLE*, Alter_inplace_info*, dd::Table const*, dd::Table*)
mysql_inplace_alter_table(THD*, dd::Schema const&, dd::Schema const&, dd::Table const*, dd::Table*, TABLE_L
mysql_alter_table(THD*, char const*, char const*, HA_CREATE_INFO*, TABLE_LIST*, Alter_info*)
Sql_cmd_alter_table::execute(THD*)
mysql_execute_command(THD*, bool)
mysql_parse(THD*, Parser_state*)
dispatch_command(THD*, COM_DATA const*, enum_server_command)
do_command(THD*)
handle_connection
pfs_spawn_thread
For a idle instance, with the same table size, alter index takes about 18 secs.
mysql> alter table t2 add index d_k_1(c, k);
Query OK, 0 rows affected (18.03 sec)
Records: 0 Duplicates: 0 Warnings: 0
Suggested fix:
I think the root cause is that we optimized away redo log writing for bulk load, so to make sure such DDL is crash safe, dirtied pages must be flushed to disk synchronously.
So one way to solve this is adding back redo log writing for bulk load, as I suggested in this bug report https://bugs.mysql.com/bug.php?id=92099.
Both MEB and PXB are affected by this
https://dev.mysql.com/doc/mysql-enterprise-backup/4.1/en/bugs.backup.html
https://www.percona.com/blog/2017/08/08/avoiding-the-an-optimized-without-redo-logging-ddl...
Another way is keep current strategy about redo writing for bulk load, and optimize the scan logic in buf_flush_or_remove_pages().
May be we can add another hazard pointer like flush_hp for this scenario, but with some variations, because hazard pointer is not designed for concurrent users, and there could be concurrent DDL .