Bug #113733 Innodb page has a large proportion of garbage space in certain conditions
Submitted: 24 Jan 2024 6:36 Modified: 1 Feb 2024 6:24
Reporter: ksql- team Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S2 (Serious)
Version:8.0, 8.0.36 OS:Any
Assigned to: CPU Architecture:Any
Tags: BTREE, page split, space

[24 Jan 2024 6:36] ksql- team
Description:
When we used the gh-ost tool to do DDL, we found that the innodb space expanded a lot after doing it. 
We analyze source code and insertion testing, found that this problem is particularly obvious when the row length is more than 3K and the larger primary id is inserted first and then the smaller id is written in order.

Because gh-ost is a method of increment binlog events first and then copy rows from origin table. Therefore, in this scenario, there will be space expansion problems every time DDL is completed. This happened in our production environment and impacted business storage usage costs.

The reason is mainly caused by the b-tree page split strategy.

in btr_page_get_split_rec_to_right func: 

# if insert_point's next_record is supremum, will split at insert_point pos not next_record pos. This strategy will cause each page to retain one garbage record in the above record insertion scenario. 

if (page_header_get_ptr(page, PAGE_LAST_INSERT) == insert_point) {
    rec_t*  next_rec;
    next_rec = page_rec_get_next(insert_point);
    if (page_rec_is_supremum(next_rec)) {
split_at_new:
        *split_rec = NULL;
    } else {
        rec_t*  next_next_rec = page_rec_get_next(next_rec);
        if (page_rec_is_supremum(next_next_rec)) {
            # here 
            goto split_at_new;
        }
        /* If there are >= 2 user records up from the insert
        point, split all but 1 off. We want to keep one because
        then sequential inserts can use the adaptive hash
        index, as they can do the necessary checks of the right
        search position just by looking at the records on this
        page. */
        *split_rec = next_next_rec;
    }
    return(TRUE);
}

# Also, I don’t understand why next_next_rec was chosen as the split point when there are >= 2 user records up from the insert point. 

How to repeat:
create table sbtest5(id int primary key auto_increment, pad1 longtext);
insert into sbtest5(id, pad1) values(1,repeat('a',4024));
insert into sbtest5(id, pad1) values(2,repeat('a',4024));
insert into sbtest5(id, pad1) values(10001,repeat('a',4024));

# use innodb_space tool analyze page space

$ innodb_space -s ibdata1 -T sbtest/sbtest5 space-index-pages-summary
page        index   level   data    free    records
3           53      0       12147   4105    3
4           0       0       0       16384   0
5           0       0       0       16384   0

# insert smaller id rows in order

insert into sbtest5(id, pad1) values(3,repeat('a',4024));
insert into sbtest5(id, pad1) values(4,repeat('a',4024));
insert into sbtest5(id, pad1) values(5,repeat('a',4024));
insert into sbtest5(id, pad1) values(6,repeat('a',4024));
insert into sbtest5(id, pad1) values(7,repeat('a',4024));
insert into sbtest5(id, pad1) values(8,repeat('a',4024));
insert into sbtest5(id, pad1) values(9,repeat('a',4024));
insert into sbtest5(id, pad1) values(10,repeat('a',4024));

$ innodb_space -s ibdata1 -T sbtest/sbtest5 space-index-pages-summary
page        index   level   data    free    records
3           53      1       84      16168   6
4           53      0       4049    12203   1
5           53      0       8098    8154    2
6           53      0       8098    8154    2
7           53      0       8098    8154    2
8           53      0       8098    8154    2
9           53      0       8098    8154    2
10          0       0       0       16384   0

# Each page only stores 2 rows, using half of the space after page splitting.
[24 Jan 2024 12:05] MySQL Verification Team
Hi Mr. team,

Thank you for your bug report.

However, we can not repeat your test case.

First of all, MySQL 5.7 is not supported, so we do not process bug reports for that version.

Second, we use only our own tools, never third-party tools. 

We could try to process your bug report if it is constrained to 8.0 and if you use our own tools only.

Can't repeat.
[31 Jan 2024 3:47] ksql- team
Hi~
The situation above can repeat easily with mysql 8.0.32.The steps are as follows:
1.create two same table, sbtest2 and sbtest3
CREATE TABLE `sbtest3` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `k` int(11) DEFAULT NULL,
  `pad` longtext COLLATE utf8mb4_unicode_ci,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
delimiter $$
create procedure load_t_2(count int unsigned)
begin
set @c=0;
while @c<count do
insert ignore into sbtest2 values (@c, 1, repeat('a',4096));
set @c=@c+1;
end while;
end;
$$
delimiter ;
2. for sbtest2, insert data sequentially, eg: call load_t_2(260000);
3. for sbtest3, do as follows:
 i) insert data with a small id=1, eg: call load_t_3(1);
 ii) insert data with a big id=1000000, eg: insert ignore into sbtest3 values (1000000, 1000000, repeat('a',4096));
 iii) insert data into sbtest3 sequentially, eg: call load_t_3(260000);
4.check data size, sbtest3 is much bigger then sbtest2, but only have one more data.
-rw-r----- 1 mysql mysql 1.4G Jan 31 10:49 sbtest2.ibd
-rw-r----- 1 mysql mysql 2.1G Jan 31 10:51 sbtest3.ibd
The reason we analyze is about the page splite strategy, please check.
[31 Jan 2024 10:28] MySQL Verification Team
Hi,

You have supplied us this time with sufficient info on the table. The only missing detail in order to repeat your test case is a tool innodb_space. That is not our tool and we use our tools here only.

It is not our tool, so we can not use it . However, there are tables in the Performance_Schema which can supply the corresponding info, so please use one of those tables to provide info on the extra space used.

Next, if you read a chapters on InnoDB, you will notice that InnoDB pages are not packed tightly. This is made so in order to keep all rows stored in the order of the PRIMARY KEY. That is fully documented in our Reference Manual.

Hence, what you are reporting is actually expected behaviour.

We also wonder what do you mean by "garbage space". Unused space is not a garbage space. It is unused, for the reason that we explained above. 

Simply, InnoDB used clustered storage of rows , sorted in the table by PK. If you do not specify a PK, InnoDB will create it's own PK.

We do not see how can this be a bug.
[1 Feb 2024 6:24] MySQL Verification Team
Hello!

Thank you for the report and detailed steps to reproduce.
Verified as described.

Sincerely,
Umesh