MySQL Bugs: #68546: InnoDB stores PRIMARY KEY fields in unique secondary index non-leaf pages

Bug #68546	InnoDB stores PRIMARY KEY fields in unique secondary index non-leaf pages
Submitted:	1 Mar 2013 19:58	Modified:	19 Aug 2013 17:15
Reporter:	Davi Arnaut (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S3 (Non-critical)
Version:	5.5	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Currently InnoDB stores "unnecessary" PKV fields into unique secondary index node pages, inflating their size. For a table with a key PRIMARY KEY (a, b, c) and UNIQUE KEY (a, d), InnoDB will store the fields b and c in the node pages of the unique key on (a, d) where they are not necessary. The size of the node pages could be reduced if these fields were not stored allowing the number of node pages required to be reduced substantially for small SK key fields.

How to repeat:
Examine the node page of an unique secondary key index.

Thank you for the bug report.

First of all, UNIQUE indexes do allow NULL values. Let me expand the example from the Description:

CREATE TABLE t1 (a INT, b INT, c INT, d INT,
 PRIMARY KEY(a,b,c),
 UNIQUE KEY (a,d)) ENGINE=InnoDB;
INSERT INTO t1 VALUES(1,1,1,NULL),(1,1,2,NULL),...,(1,1,99999,NULL);

This will create two entries in the unique secondary index (a,d), whose fields would be
(a,d,b,c)=(1,NULL,1,1),(1,NULL,1,2),...

Multiple 'duplicate' values with NULL values in the keys are allowed, because NULL is not equal to NULL in SQL semantics. (Note that because the PRIMARY KEY does not allow NULL values, adding the PRIMARY KEY fields will make the row unique.)

This does not contradict your request yet. While we do need the PRIMARY KEY fields in the leaf page records, we could theoretically slim down the node pointer records (non-leaf page records). If we assume that each leaf page contains roughly 100 records, the node pointer records would currently be of this form:

(1,NULL,1,1,page_no1),(1,NULL,1,101,page_no2),(1,NULL,1,201,page_no3),...

Now, if we are searching for (a,d,c)=(1,NULL,430) we could theoretically prune the leaf pages for which there is not going to be any match. Unfortunately, Index Condition Pushdown currently only works on the leaf page records, so there is no saving here. (Look for row_search_idx_cond_check().)

A more important reason is cursor positioning. Let us assume that btr_pcur_t::old_rec points to (1,NULL,1,330). With the current scheme, btr_pcur_restore_position() can simply dereference the node pointer of (1,NULL,1,301,page_no) and then descend to the page_no. If the node pointer records only stored (1,NULL,page_no), then the cursor restore would have to scan all leaf pages that have the prefix (1,NULL).

Now, let us assume that you ask for this optimization only when the unique index does not allow NULL values.

Another problematic and maybe improbable case would be the update of the PRIMARY KEY. Let us assume the following:

CREATE TABLE t2 (a INT, b INT, c INT, d INT NOT NULL,
 PRIMARY KEY(a,b,c),
 UNIQUE KEY (a,d)) ENGINE=InnoDB;
INSERT INTO t2 VALUES(1,1,1,1),(1,1,2,2),...(1,1,99999,99999);
BEGIN;
UPDATE t2 SET c=c+100000;
UPDATE t2 SET c=c+100000;
UPDATE t2 SET c=c+100000;
...
UPDATE t2 SET c=c+100000; -- nth update

See https://blogs.oracle.com/mysqlinnodb/entry/mysql_5_5_innodb_change for some background on multi-versioning.

The INSERT would create the following leaf page records in the unique index of t2:

(a,d,b,c)=(1,1,1,1),(1,2,1,2),...,(1,99999,1,99999)

Let us assume that purge does not run immediately after the UPDATEs. The first UPDATE would delete-mark all of the above entries and insert new ones:

(a,d,b,c)=(1,1,1,100001),(1,2,1,100002),...,(1,99999,1,199999)

Likewise, each new update would delete-mark the existing rows and insert new ones. Finally, we would have the following delete-unmarked rows in the unique secondary index, with m=100000*n:

(a,d,b,c)=(1,1,1,m+1),(1,2,1,m+2),...,(1,99999,1,m+99999)

In addition to this, we will have a number of delete-marked rows:
for k = 0 to m-1 step 100000:
 (a,d,b,c)=(1,1,1,k+1),(1,2,1,k+2),...,(1,99999,1,k+99999)

Now, let us assume that we omit the 'unnecessary' PRIMARY KEY columns
from the node pointer records and have the following only:

(a,d,child_page_no)=(1,1,page_no),(1,2,page_no),...

It could be that we need multiple (1,1,page_no) node pointers, because
in our example, in addition to the row (a,d,b,c)=(1,1,1,m+1) we would
have n delete-marked records identified by (a,d)=(1,1):
(a,d,b,c)=(1,1,1,k+1). The n could be so large that we have several
leaf pages of these records. Again, btr_pcur_restore_position() could
degenerate to a sequential scan, while the current scheme allows it to
uniquely identify the current position.

NOTE 1: In this example, the UPDATE statements have not been committed
yet. So, a concurrently running transaction may need to see the
version history of the rows.

NOTE 2: The node pointer record may point to a record that is later
purged. For MVCC and queries, only the leaf page records matter. (But,
as noted, we could theoretically exploit the node pointer records in
index condition pushdown or for detecting the end of a range scan.)

NOTE 3: The delete-mark flag in InnoDB node pointer records is a waste
of space. Nobody cares about its value.

Thanks Marko. Like we had previously discussed, I should have noted that this only applies if NOT NULL fields are being used.

Davi, my example of updating the PRIMARY KEY columns applies even to NOT NULL UNIQUE indexes.

In theory, we could dynamically truncate the node pointers. Let us assume that we have index records as in the example:

(a,d,b,c)=(1,1,1,1),(1,2,1,2),...,(1,99999,1,99999)

It could suffice to have node pointers
(a=1,d=1,page_no),(a=1,d=101,page_no),(a=1,d=201,page_no)
and so on, instead of having
(a=1,d=1,b=1,c=1,page_no),(1,101,1,101,page_no).

Similarly, if the last column in the remaining node pointer is a string, we could truncate the string to a unique prefix.

Only if we got lots of different values of "b" for a given prefix of (a,d) we would need to expand the node pointer by that field.

Likewise, if we inserted a bunch of records with a=2, it could be sufficient to have a node pointer with just (a=2,page_no) to distinguish it from all the records that start with a=1.

This would require some changes to the B-tree code and a file format change. This optimization would benefit all node pointers, not just those in unique indexes.

The degenerate case you describe are not really that meaningful. If the use of the index degenerates into these kind of problems, there's many other issues that may arise and need to necessarily be addressed by InnoDB. The point remains that for the generalized case, the PKV fields could be eliminated.