MySQL Bugs: #95582: DDL using bulk load is very slow under long flush

Bug #95582	DDL using bulk load is very slow under long flush_list
Submitted:	31 May 2019 5:02	Modified:	21 Jun 2019 13:37
Reporter:	Fungo Wang (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S3 (Non-critical)
Version:	8.0.16, 5.7.26	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	bulk load, DDL, flush list, innodb

Description:
This issue is a bit like what this patch (809bc0e2bfda1033210a7b14265d984a7736ec65) aimed to fix.

commit 809bc0e2bfda1033210a7b14265d984a7736ec65
Author: Shaohua Wang <shaohua.wang@oracle.com>
Date:   Tue Apr 28 14:17:05 2015 +0800

    Bug#19865673 DDL LIKE ADD INDEX IS VERY SLOW IN 5.7.5

    Analysis:
    We have a checkpoint at the end of bulk load to flush all dirty pages to disk,
    because we disable redo logging for bulk load. It works well for big table, but
    not for small table in heavy work load.

    Solution:
    Only flush the dirty pages modified by the bulk load. the main idea is to set a
    flush observer to these dirty pages when committing mtr during a bulk load. When
    a page is submitted to flush or a page is flushed to disk, it notifies the
    observer. We scan flush list in buffer pool and flush dirty pages with the
    observer at the end of the bulk load, and wait for all the flushes to finish.

    Reviewed-by: Jimmy Yang <jimmy.yang@oracle.com>
    RB: 8238

Currently FlushObserver is to make sure dirty pages modified by bulk load are flushed to disk, before finishing DDL. But the flushing is not efficient.

Check logic in buf_flush_or_remove_pages(), whenever a dirty page is successfully flushed to disk, it has to restart scanning from the tail of flush_list.

If the flush list is very long (under heavy load), and the dirty pages modified by bulk load are added near the head of flush list, the flushing will be very slow, approximately O(n*n).

How to repeat:
Check logic in buf_flush_or_remove_pages().

We observed this issue in our pro environment, and I constructed a perf test to show this issue:

1. innodb_buffer_pool_size = 128G
2. prepare data using sysbench with --tables=16  --table_size=50000000, make the data bigger than BP
3. run sysbench oltp_update_non_index for a while, and make sure most BP pages are dirtied

------innodb bp pages status-----
   data   free  dirty flush dity%|
8015376   2670 7445445 18624  92.9
8015732   2302 7444745 11409  92.9
8016224   1806 7446387  7341  92.9
8015949   2081 7447139 11325  92.9
8016349   1681 7446657  7290  92.9
8016334   1715 7445316 11236  92.9
8016175   1881 7443963 22475  92.9
8015580   2482 7446586  7324  92.9

4. alter add index on another small table. In my case, t1 is like a sysbench table, with 2000000 record, about 500MB

mysql> alter table t1 add index d_k_1(c, k);
Query OK, 0 rows affected (12 min 19.39 sec)
Records: 0  Duplicates: 0  Warnings: 0
This alter will take 12min to finish.

The perf result shows buf_flush_dirty_pages is very hot.

+  35.30%  mysqld  mysqld                [.] ut_delay(unsigned long)
-   9.26%  mysqld  mysqld                [.] buf_flush_dirty_pages(buf_pool_t*, unsigned int, FlushObserver*, bo
     buf_flush_dirty_pages(buf_pool_t*, unsigned int, FlushObserver*, bool, trx_t const*, bool) [clone .constpro
     buf_LRU_flush_or_remove_pages(unsigned int, buf_remove_t, trx_t const*, bool)
     FlushObserver::flush()
     row_merge_build_indexes(trx_t*, dict_table_t*, dict_table_t*, bool, dict_index_t**, unsigned long const*, u
     bool ha_innobase::inplace_alter_table_impl<dd::Table>(TABLE*, Alter_inplace_info*, dd::Table const*, dd::Ta
     ha_innobase::inplace_alter_table(TABLE*, Alter_inplace_info*, dd::Table const*, dd::Table*)
     mysql_inplace_alter_table(THD*, dd::Schema const&, dd::Schema const&, dd::Table const*, dd::Table*, TABLE_L
     mysql_alter_table(THD*, char const*, char const*, HA_CREATE_INFO*, TABLE_LIST*, Alter_info*)
     Sql_cmd_alter_table::execute(THD*)
     mysql_execute_command(THD*, bool)
     mysql_parse(THD*, Parser_state*)
     dispatch_command(THD*, COM_DATA const*, enum_server_command)
     do_command(THD*)
     handle_connection
     pfs_spawn_thread

For a idle instance, with the same table size, alter index takes about 18 secs.

mysql> alter table t2 add index d_k_1(c, k);
Query OK, 0 rows affected (18.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

Suggested fix:
I think the root cause is that we optimized away redo log writing for bulk load, so to make sure such DDL is crash safe, dirtied pages must be flushed to disk synchronously.

So one way to solve this is adding back redo log writing for bulk load, as I suggested in this bug report https://bugs.mysql.com/bug.php?id=92099. 

Both MEB and PXB are affected by this
https://dev.mysql.com/doc/mysql-enterprise-backup/4.1/en/bugs.backup.html
https://www.percona.com/blog/2017/08/08/avoiding-the-an-optimized-without-redo-logging-ddl...

Another way is keep current strategy about redo writing for bulk load, and optimize the scan logic in buf_flush_or_remove_pages().
May be we can add another hazard pointer like flush_hp for this scenario, but with some variations, because hazard pointer is not designed for concurrent users, and there could be concurrent DDL .

Hi Mr. Wang,

Thank you for your bug report.

I have managed to repeat the behaviour with a half of the usage of the resources that you have used.

Verified as reported.

Has this problem been fixed? Or are there any optimized approaches? We've also run into this issue recently.