Bug #67718 InnoDB drastically under-fills pages in certain conditions
Submitted: 27 Nov 2012 0:54 Modified: 14 Jul 2014 12:23
Reporter: Jeremy Cole (Basic Quality Contributor) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S3 (Non-critical)
Version:All OS:Any
Assigned to:
Tags: innodb

[27 Nov 2012 0:54] Jeremy Cole
Description:
Inserting into an InnoDB table with a key that falls between the maximum key of a completely full page with direction inferred to "right" and the minimum key of the "next" page can lead to unnecessary page splits and drastically under-filled pages.

In practice this can happen in production systems when insertion order is nearly but not quite ascending order -- this can allow for rows to be inserted that create a gap in logical values in the index, which is then filled in by later inserts.

How to repeat:
# Using a really simple schema, but a few payload columns in order to limit
# the number of inserts necessary to trigger the bug.

DROP TABLE IF EXISTS test.page_split_test;
CREATE TABLE test.page_split_test
(
  id BIGINT UNSIGNED NOT NULL,
  payload1 CHAR(255) NOT NULL,
  payload2 CHAR(255) NOT NULL,
  payload3 CHAR(255) NOT NULL,
  payload4 CHAR(255) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=INNODB;

# Fill up the root page, but don't split it.

INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (1, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));
INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (2, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));
INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (3, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));
INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (4, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));
INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (5, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));
INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (6, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));
INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (7, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));
INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (8, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));
INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (9, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));
INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (10, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));
INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (11, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));
INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (12, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));
INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (13, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));
INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (14, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));

# The next insert will trigger the root page to be split into a root non-leaf
# page and two leaf pages, both about half full.

INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (15, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));

# Insert enough records to fill up the right-most leaf page, but don't split it.

INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (16, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));
INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (17, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));
INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (18, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));
INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (19, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));
INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (20, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));
INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (21, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));

# Increase the id by some amount, and insert a few records in descending order
# from that point. Each row inserted will split the "full" page from above, and
# create a new page for the single record.

INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (30, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));
INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (29, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));
INSERT INTO test.page_split_test (id, payload1, payload2, payload3, payload4) VALUES (28, REPEAT("A", 255), REPEAT("B", 255), REPEAT("C", 255), REPEAT("D", 255));

# Dump the page information about the test table.

SELECT page_number, page_type, number_records, data_size
FROM information_schema.innodb_buffer_page
WHERE table_name = "test/page_split_test" AND index_name = "PRIMARY";

# The pages for the table should look like:
#
# +-------------+-----------+----------------+-----------+
# | page_number | page_type | number_records | data_size |
# +-------------+-----------+----------------+-----------+
# |           3 | INDEX     |              5 |        85 |
# |           4 | INDEX     |              7 |      7378 |
# |           5 | INDEX     |             14 |     14756 |
# |           6 | INDEX     |              1 |      1054 |
# |           7 | INDEX     |              1 |      1054 |
# |           8 | INDEX     |              1 |      1054 |
# +-------------+-----------+----------------+-----------+
[27 Nov 2012 5:20] Davi Arnaut
A possible way to fix this is to restrict this type of split to the rightmost page, which still preserves the optimization for ascending key inserts. This seems to be the approach taken in other databases.
[27 Nov 2012 16:36] Sinisa Milivojevic
Jeremy,

Is it same for all ROW_FORMATs ???

Just let me know whether you have tested it or have time to test it.

I guess it is, but I am not sure for the COMPRESSED format ....
[27 Nov 2012 17:18] Jeremy Cole
Sinisa: It should apply to all table formats, since the logic for this decision is completely independent of the table format, but I have not tested it because the exact number of rows required to exactly fill a page without splitting it will differ (so the test case will differ a bit).
[27 Nov 2012 17:20] Jeremy Cole
Additionally, I had a working theory that this problem also applies to non-leaf pages, and although I have not produced a test case to duplicate it, I found instances of its "pattern" in our production tables. It would require quite a lot of inserts to produce a test case for that, since the B-tree needs to grow to a minimum of 3 levels deep.
[30 Nov 2012 12:16] Valerii Kravchuk
Sinisa,

I wonder if you had used Oracle's 5.5.20 to verify and, if you had, how exactly you verified the bug? The test case used here refers to I_S.INNODB_BUFFER_PAGE to prove the problem, while the table does not exist in Oracle's 5.5.27 (and even less in 5.5.20):

mysql> show variables like 'ver%';
+-------------------------+------------------------------+
| Variable_name           | Value                        |
+-------------------------+------------------------------+
| version                 | 5.5.27-log                   |
| version_comment         | MySQL Community Server (GPL) |
| version_compile_machine | x86                          |
| version_compile_os      | Win32                        |
+-------------------------+------------------------------+
4 rows in set (0.00 sec)

mysql> use information_schema
Database changed
mysql> show tables like 'innodb_b%';
Empty set (0.34 sec)

mysql> show tables like 'innodb%';
+----------------------------------------+
| Tables_in_information_schema (innodb%) |
+----------------------------------------+
| INNODB_CMP_RESET                       |
| INNODB_TRX                             |
| INNODB_CMPMEM_RESET                    |
| INNODB_LOCK_WAITS                      |
| INNODB_CMPMEM                          |
| INNODB_CMP                             |
| INNODB_LOCKS                           |
+----------------------------------------+
7 rows in set (0.05 sec)

Surely, MySQL manual says the table is there in 5.5, see http://dev.mysql.com/doc/refman/5.5/en/innodb-i_s-tables.html.

Do I miss anything, like a good way to study buffer pool content (or data pages by type on disk) in Oracle's MySQL 5.5.x?
[30 Nov 2012 14:33] Sinisa Milivojevic
Valeriy,

You are correct. I made mistake. I verified two bugs , one with 5.5.20 and this one with 5.6.7 and made a typo.

So, verified with 5.6.7-rc. Not with 5.5.20.
[30 Nov 2012 18:44] Jeremy Cole
Valeriy: The innodb_buffer_page table was introduced in MySQL 5.5.28.
[30 Nov 2012 18:44] Sinisa Milivojevic
There are no plans yet on when will this behavior be changed. It is definitely not on the current roadmap, but it will be re-considered in the future.
[1 Dec 2012 23:51] Jeremy Cole
Valeriy:

If you have a working Ruby installation you should be able to install my innodb_ruby gem (gem install innodb_ruby), which will allow you to print the tablespace contents for any "compact" table (the old "redundant" format is not yet supported).

$ sudo innodb_space -f /usr/local/mysql/data/test/page_split_test.ibd space-index-pages-summary
page        index   level   data    free    records 
3           157     1       85      16167   5       
4           157     0       7378    8872    7       
5           157     0       14756   1492    14      
6           157     0       1054    15198   1       
7           157     0       1054    15198   1       
8           157     0       1054    15198   1       
9           0       0       0       16384   0       

See: http://rubygems.org/gems/innodb_ruby

See: https://github.com/jeremycole/innodb_ruby
[22 Jan 2013 19:27] Davi Arnaut
Link to fix: https://github.com/twitter/mysql/commit/48996cad34e0f2e11068f92f86e1eb7eda1d87fe

Apparently, "contributions can be accepted to Open bugs only".
[23 Jan 2013 13:41] Sinisa Milivojevic
Davi,

Thank you for your contribution.

It will be very seriously analyzed and considered.
[24 Jan 2013 15:09] Calvin Sun
The cause of the problem is from this change: http://lists.mysql.com/internals/12980?f=plain.
[30 Jan 2013 1:52] Calvin Sun
Heikki proposed the following changes to address the issue:

A better heuristic: if the number of records on the page is "big" (say, there are > 10 records), then we can require PAGE_N_DIRECTION > 5 to believe this is a sequential insert pattern.

Something like this for btr_page_get_split_rec_to_left():

    if ((page_header_get_ptr(page, PAGE_LAST_INSERT)
         == page_rec_get_next(insert_point))
            && ((page_get_n_recs(page) < 10)
                || ((page_header_get_field(page, PAGE_DIRECTION) == PAGE_LEFT)
                     && (page_header_get_field(page, PAGE_N_DIRECTION)
                         >= 5)
                   )
               )
           ) { /* Ok, it is fairly sure this really is consecutive inserts */

This change would make the split algorithm more conservative. Can this cause bad space utilization under some other workloads? Unfortunately, yes.
[21 Apr 2013 3:57] Davi Arnaut
Here is a little bit simpler variation that exploits the way the root page is split:

CREATE TABLE t1 (a BIGINT PRIMARY KEY, b VARCHAR(4096)) ENGINE=InnoDB;

# Build b-tree so that the leftmost leaf page has records 0, 1 and 2.
INSERT INTO t1 VALUES (0, REPEAT('a', 4096));
INSERT INTO t1 VALUES (1000, REPEAT('a', 4096));
INSERT INTO t1 VALUES (1001, REPEAT('a', 4096));
INSERT INTO t1 VALUES (1002, REPEAT('a', 4096));
INSERT INTO t1 VALUES (1, REPEAT('a', 4096));
INSERT INTO t1 VALUES (2, REPEAT('a', 4096));

# Each insert/record down from 999 will end up on its on page.
INSERT INTO t1 VALUES (999, REPEAT('a', 4096));
INSERT INTO t1 VALUES (998, REPEAT('a', 4096));
[22 Apr 2013 13:46] Sinisa Milivojevic
Davi,

Thanks for your contribution.
[28 May 2014 9:51] Yasufumi Kinoshita
Basically, InnoDB always attempts to insert record to the end of page rather
than the first of the next page, if the insert record is between the pages.

I think the better solution is...
"If the insert point is the last of the page, try to insert to the next page
before split the page."

The changing splitting strategy affects to the other cases and might cause
bad effect.
But this fix doesn't affect to the other cases.
[23 Jun 2014 14:41] Daniel Price
Fixed as of the upcoming 5.6.20, 5.7.5 release, and here's the changelog entry:

Inserting a record into an "InnoDB" table with a key that falls between
the maximum key of a full page and the minimum key of the "next" page
could result in unnecessary page splits and under-filled pages. If the
insert point is at the end of a page, "InnoDB" now attempts to insert to
the next page before splitting the page. 

Thank you for the bug report.
[2 Jul 2014 11:24] Daniel Price
The fix for "Bug #67718 InnoDB drastically under-fills pages in certain conditions" has been reverted due to a recently encountered regression. The fix does not appear in 5.6.20, as previously documented.
[2 Jul 2014 16:41] Jeremy Cole
Ouch. Thanks for the try anyway. Looking forward to seeing the patch in the future. :)
[14 Jul 2014 12:23] Daniel Price
The issue with the previous patch has been addressed. 

Fixed as of the upcoming 5.6.21, 5.7.5 release, and here's the changelog
entry:

Inserting a record into an "InnoDB" table with a key that falls between
the maximum key of a full page and the minimum key of the "next" page
could result in unnecessary page splits and under-filled pages. If the
insert point is at the end of a page, "InnoDB" now attempts to insert to
the next page before splitting the page.

Thank you for the bug report.
[7 Aug 2014 6:46] Laurynas Biveinis
$ bzr log -n0 -r 6008
------------------------------------------------------------
revno: 6008
committer: Yasufumi Kinoshita <yasufumi.kinoshita@oracle.com>
branch nick: mysql-5.6
timestamp: Fri 2014-06-20 08:48:48 +0200
message:
  Bug#15923864 : Bug#67718 : INNODB DRASTICALLY UNDER-FILLS PAGES IN CERTAIN CONDITIONS
  
  InnoDB should try to insert to the next page before split the page, if the insert record for split_and_insert is last of the page.
  
  Otherwise, the (reverse) sequential inserts to the middle of the index might cause 1 record per 1 page.
  
  Approved by Marko in rb#5563
[7 Aug 2014 6:58] Laurynas Biveinis
$ bzr log -n0 -r 6050
------------------------------------------------------------
revno: 6050
committer: Marko M?kel? <marko.makela@oracle.com>
branch nick: mysql-5.6.20-release
timestamp: Tue 2014-07-01 11:41:45 +0300
message:
  Revert Bug#15923864 fix (revno 6008) from MySQL 5.6.20.
[24 Sep 2014 17:15] Daniel Price
bzr log -n0 -r 6073
------------------------------------------------------------
revno: 6073
committer: Yasufumi Kinoshita <yasufumi.kinoshita@oracle.com>
branch nick: mysql-5.6
timestamp: Mon 2014-07-14 15:06:43 +0900
message:
  This is the second attempt to fix the bug
  
  Bug#15923864 : Bug#67718 : INNODB DRASTICALLY UNDER-FILLS PAGES IN CERTAIN CONDITIONS
  
  InnoDB should try to insert to the next page before split the page, if the insert record for split_and_insert is last of the page.
  
  Otherwise, the (reverse) sequential inserts to the middle of the index might cause 1 record per 1 page.
  
  -------
  The previous attempt had the bug and fixed already at this push.
  
  Bug#19045042 : ASSERT MACH_READ_FROM_8()==INDEX->ID RECV_RECOVERY_IS_ON() IS_INSIDE_IBUF()
  The new function btr_insert_into_right_sibling() misses to treat ibuf bitmap for the page which the record inserted.