MySQL Bugs: #114468: Inconsistent table performance: fresh data import vs after optimization

Bug #114468	Inconsistent table performance: fresh data import vs after optimization
Submitted:	24 Mar 2024 21:51	Modified:	25 Mar 2024 16:44
Reporter:	Przemyslaw Malkowski	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	8.3.0, 8.0.36	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
I found out that there is a massive speed difference for certain queries on the same table, but depending on whether the table was freshly populated with data as opposed to after table rebuild (via optimize table or noop-alter).

How to repeat:
All below tests were made on the same instance, with all default settings, except these:
skip-log-bin
innodb_flush_method=O_DIRECT
innodb_flush_log_at_trx_commit=0
innodb_redo_log_capacity=2G
innodb_parallel_read_threads=1

Multiple summary rows are to show that I ran the queries multiple times in a row to make sure the result is stable.

The initial table was populated with sysbench as follows:

$ sysbench /usr/share/sysbench/oltp_common.lua --mysql_storage_engine=innodb  --table-size=10000000 --tables=1 --mysql-db=db1 --mysql-user=msandbox --mysql-password=msandbox --db-driver=mysql --mysql-socket=/tmp/mysql_sandbox8300.sock prepare
sysbench 1.0.20 (using system LuaJIT 2.0.4)

Creating table 'sbtest1'...
Inserting 10000000 records into 'sbtest1'
Creating a secondary index on 'sbtest1'...

mysql > select @@version,@@innodb_parallel_read_threads;
+-----------+--------------------------------+
| @@version | @@innodb_parallel_read_threads |
+-----------+--------------------------------+
| 8.3.0     |                              1 |
+-----------+--------------------------------+
1 row in set (0.00 sec)

mysql > select count(*) from sbtest1;
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (17.09 sec)
1 row in set (17.31 sec)
1 row in set (17.30 sec)

The result tablespace size:

-rw-r-----. 1 mysql mysql 2.3G Mar 24 06:38 sbtest1.ibd

After optimizing, the count speed results are much different (~2x slower)!

mysql > optimize table sbtest1;
+-------------+----------+----------+-------------------------------------------------------------------+
| Table       | Op       | Msg_type | Msg_text                                                          |
+-------------+----------+----------+-------------------------------------------------------------------+
| db1.sbtest1 | optimize | note     | Table does not support optimize, doing recreate + analyze instead |
| db1.sbtest1 | optimize | status   | OK                                                                |
+-------------+----------+----------+-------------------------------------------------------------------+
2 rows in set (2 min 30.81 sec)

-rw-r-----. 1 mysql mysql 2.6G Mar 24 07:09 sbtest1.ibd

mysql > select count(*) from sbtest1;
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (36.68 sec)
1 row in set (34.51 sec)
1 row in set (36.06 sec)

Rebuilding via noop alter does not change anything:

mysql > alter table sbtest1 engine=innodb;
Query OK, 0 rows affected (2 min 32.67 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql > select count(*) from sbtest1;
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (38.65 sec)
1 row in set (37.78 sec)
1 row in set (38.98 sec)

Then I create a shadow copy:

mysql > create table sbtest11 like sbtest1;
Query OK, 0 rows affected (0.08 sec)

mysql > insert into sbtest11 select * from sbtest1;
Query OK, 10000000 rows affected (11 min 39.46 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

-rw-r-----. 1 mysql mysql 2.4G Mar 24 07:47 sbtest11.ibd
-rw-r-----. 1 mysql mysql 2.6G Mar 24 07:24 sbtest1.ibd

And the speed of the new table is again ~2x faster:

mysql > select count(*) from sbtest11;
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (17.16 sec)
1 row in set (17.24 sec)
1 row in set (16.87 sec)

Surprisingly, SELECT queries using a secondary index are WAAY faster on the optimized table (explain is identical for both):

mysql > explain select avg(k) from sbtest1;
+----+-------------+---------+------------+-------+---------------+------+---------+------+---------+----------+-------------+
| id | select_type | table   | partitions | type  | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+---------+------------+-------+---------------+------+---------+------+---------+----------+-------------+
|  1 | SIMPLE      | sbtest1 | NULL       | index | NULL          | k_1  | 4       | NULL | 9864360 |   100.00 | Using index |
+----+-------------+---------+------------+-------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

mysql > select avg(k) from sbtest1;
+--------------+
| avg(k)       |
+--------------+
| 5009290.5886 |
+--------------+
1 row in set (4.86 sec)

mysql > select avg(k) from sbtest11;
+--------------+
| avg(k)       |
+--------------+
| 5009290.5886 |
+--------------+
1 row in set (1 min 0.26 sec)

But queries using full table scan are, on the contrary - much faster on a newly populated table:

mysql > explain select avg(pad) from sbtest1;
+----+-------------+---------+------------+------+---------------+------+---------+------+---------+----------+-------+
| id | select_type | table   | partitions | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra |
+----+-------------+---------+------------+------+---------------+------+---------+------+---------+----------+-------+
|  1 | SIMPLE      | sbtest1 | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 9864360 |   100.00 | NULL  |
+----+-------------+---------+------------+------+---------------+------+---------+------+---------+----------+-------+
1 row in set, 1 warning (0.00 sec)

mysql > select avg(pad) from sbtest1;
+-------------------+
| avg(pad)          |
+-------------------+
| 49995900874.02084 |
+-------------------+
1 row in set, 65535 warnings (46.87 sec)

mysql > select avg(pad) from sbtest11;
+-------------------+
| avg(pad)          |
+-------------------+
| 49995900874.02084 |
+-------------------+
1 row in set, 65535 warnings (20.94 sec)

The same happens when I force the table scan on the optimized table, it's again faster on the populated one:

mysql > select avg(k) from sbtest1 ignore key(k_1);
+--------------+
| avg(k)       |
+--------------+
| 5009290.5886 |
+--------------+
1 row in set (40.53 sec)

mysql > select avg(k) from sbtest11 ignore key(k_1);
+--------------+
| avg(k)       |
+--------------+
| 5009290.5886 |
+--------------+
1 row in set (18.56 sec)

Suggested fix:
There are substantial performance differences given the same table when data is inserted normally vs after table rebuild. If this is an expected behavior, I believe it is not well documented, but please correct me if I am wrong.
If not expected, best to make the performance consistent.

Hello Przemyslaw,

Thank you for the report and feedback.
I'm currently analyzing the issue, it would be really good if you could  please share the innodb_buffer_pool_size details if not default. Thank you.

Sincerely,
Umesh

Hello Umesh,

BP was at default 128M, so much smaller than the table size. The issue is visible better when disk reads are involved :)

Hello Przemyslaw,

Thank you for the details, I'm able to reproduce issue with provided steps.

Sincerely,
Umesh

Test results - 8.0.36

Attachment: 114468_8.0.36.results (application/octet-stream, text), 23.94 KiB.

Hello Przemyslaw,

One more question please. Are using SSD or HDD? Thank you.
Sorry for asking too much :)

Sincerely,
Umesh

No worries :)
I did test on both NVMe flash disk, as well as on a RAID array made from HDDs (spinning drives). In both cases, the impact was similar.

Hi all:

By analyzing the test result data, I think that the performance difference mainly comes from linear readahead. In addition, the Innob did not specify the expected page number for a new allocating page during rebuilding B-tree index for bulk load, which caused an interesting phenomenon. 

Add special debugging code:

```

-dberr_t Page_load::init() noexcept {
+dberr_t Page_load::init() noexcept {
   page_t *new_page;
   page_no_t new_page_no;
   buf_block_t *new_block;
@@ -393,6 +393,8 @@ dberr_t Page_load::init() noexcept {
   m_cur_rec = page_get_infimum_rec(new_page);
   ut_ad(m_is_comp == page_is_comp(new_page));
   m_free_space = page_get_free_space_of_empty(m_is_comp);
+
+  printf("level: %ld,  page_no: %u \n", m_level, m_page_no);

```

output:

...
level: 0, page_no: 116
level: 0, page_no: 128
...
level: 0, page_no: 172
level: 0, page_no: 192
...
level: 0, page_no: 64440
level: 0, page_no: 73792
...
level: 0, page_no: 160316
level: 0, page_no: 160832
... 

It is very terrible, which works out of order accessed pages and makes linear readahead ineffective. The below code caused discontinuity in page numbers, and is the reason for the increase in the physical size of the tablespace(~12.5%).

```

page_no_t fseg_alloc_page_no(fil_space_t *space, const page_size_t &page_size,
                             fseg_inode_t *seg_inode, page_no_t hint,
                             byte direction,
                             mtr_t *mtr IF_DEBUG(, bool has_done_reservation) {

...

 if (direction != FSP_NO_DIR &&
      reserved - used < reserved * (fseg_reserve_pct / 100) &&
      used >= FSEG_FRAG_LIMIT) {
    ret_descr = fseg_alloc_free_extent(seg_inode, space_id, page_size, mtr);
    if (ret_descr) {
      /* 3. We take any free extent (which was already assigned above
      ===============================================================
      in the if-condition to ret_descr) and take the lowest or
      ========================================================
      highest page in it, depending on the direction
      ==============================================*/
      ret_page = xdes_get_offset(ret_descr);

      if (direction == FSP_DOWN) {
        ret_page += FSP_EXTENT_SIZE - 1;
      } else if (xdes_get_state(ret_descr, mtr) == XDES_FSEG_FRAG) {
        ret_page += xdes_find_bit(ret_descr, XDES_FREE_BIT, true, 0, mtr);
      }

      ut_ad(!has_done_reservation || ret_page != FIL_NULL);
      /*-----------------------------------------------------------*/
    }
  }
...

}

```

The patch would specify the expected page number for a new allocating page during rebuilding B-tree index for bulk load.

Attachment: hint_page_no_for_page_load.patch (application/octet-stream, text), 2.83 KiB.

Hello huahua xu,

Thank you for your contribution, please ensure to re-send the patch via "contribution" tab. Otherwise we would not be able to accept it. Thank you!

Sincerely,
Umesh

Hello huahua xu,

This is just a follow-up message in continuation of my earlier message.
Thank you very much for your patch contribution, we appreciate it!
Please ensure to re-send the patch via "contribution" tab. Thank you!

Sincerely,
Umesh

Hello huahua xu,

Thank you very much for your patch contribution, we appreciate it!
In order for us to continue the process of reviewing your contribution to MySQL, please send us a signed copy of the Oracle Contributor Agreement (OCA) as outlined in https://oca.opensource.oracle.com
Signing an OCA needs to be done only once and it's valid for all other Oracle governed Open Source projects as well. Getting a signed/approved OCA on file will help us facilitate your contribution - this one, and others in the future.  

Please let me know, if you have any questions.

Sincerely,
Umesh