MySQL Bugs: #19548: JOIN and composite indexes (not using full length)

Bug #19548	JOIN and composite indexes (not using full length)
Submitted:	4 May 2006 22:09	Modified:	8 Apr 2010 16:16
Reporter:	Mathew Johnston	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S4 (Feature request)
Version:	5.0.18-pro, 5.1.40sp1	OS:	Linux (Redhat Linux ES4)
Assigned to:		CPU Architecture:	Any

Description:
This is a re-report of Bug #12284. 

I have found that when joining between tables using criteria on multiple columns which are indexed (in a composite index), only the first column included in the index is used. This results in far lower performance on large tables than is possible.

Of course on such a small table the impact is not significant, but in my real-world situation my test_main table has billions of records and test_fk has about 80,000 (and slowly grows). 

Thank you,
Mathew Johnston

How to repeat:

-- TEST DATA CREATION

create database jointest;
use jointest;

create table test_fk(
	id int unsigned auto_increment primary key, 
	name varchar(255) not null, 
	key(name)
);
insert into test_fk (name) values ('a1');
insert into test_fk (name) values ('a2');
insert into test_fk (name) values ('b1');
insert into test_fk (name) values ('b2');
insert into test_fk (name) values ('b3');
insert into test_fk (name) values ('c1');
insert into test_fk (name) values ('d1');

create table test_main(
	id int unsigned auto_increment primary key, 
	fk_id int unsigned not null, 
	key fk_id_id (fk_id, id), 
	key id_fk_id (id, fk_id),
	constraint `test_fk` foreign key (fk_id) references test_fk (id)
);
insert into test_main (fk_id) values (1);
insert into test_main (fk_id) values (2);
insert into test_main (fk_id) values (3);
insert into test_main (fk_id) values (4);
insert into test_main (fk_id) values (1);
insert into test_main (fk_id) values (1);
insert into test_main (fk_id) values (1);
insert into test_main (fk_id) values (3);
insert into test_main (fk_id) values (3);
insert into test_main (fk_id) values (3);
insert into test_main (fk_id) values (3);
insert into test_main (fk_id) values (5);
insert into test_main (fk_id) values (6);
insert into test_main (fk_id) values (7);
insert into test_main (fk_id) values (3);
insert into test_main (fk_id) values (3);

-- DONE TEST DATA CREATION

mysql> explain select count(*) from test_main where test_main.id > 3 and test_main.id < 10 and fk_id IN (3,4,5);
+----+-------------+-----------+-------+---------------------------+----------+---------+------+------+--------------------------+
| id | select_type | table     | type  | possible_keys             | key      | key_len | ref  | rows | Extra                    |
+----+-------------+-----------+-------+---------------------------+----------+---------+------+------+--------------------------+
|  1 | SIMPLE      | test_main | range | PRIMARY,fk_id_id,id_fk_id | fk_id_id | 8       | NULL |    3 | Using where; Using index |
+----+-------------+-----------+-------+---------------------------+----------+---------+------+------+--------------------------+
1 row in set (0.02 sec)

Notice that in this query the fk_id_id index use is to key_len 8 (both ID ranges are resolved using the index).

In the following query, my expectation is that 8 bytes of an index will also be used.

mysql> explain select count(*) from test_main join test_fk on test_main.fk_id=test_fk.id where test_main.id > 3 and test_main.id < 10 and test_fk.name like 'b%';
+----+-------------+-----------+-------+---------------------------+----------+---------+---------------------+------+--------------------------+
| id | select_type | table     | type  | possible_keys             | key      | key_len | ref                 | rows | Extra                    |
+----+-------------+-----------+-------+---------------------------+----------+---------+---------------------+------+--------------------------+
|  1 | SIMPLE      | test_fk   | range | PRIMARY,name              | name     | 257     | NULL                |    2 | Using where              |
|  1 | SIMPLE      | test_main | ref   | PRIMARY,fk_id_id,id_fk_id | fk_id_id | 4       | jointest.test_fk.id |    2 | Using where; Using index |
+----+-------------+-----------+-------+---------------------------+----------+---------+---------------------+------+--------------------------+
2 rows in set (0.00 sec)

Notice that only 4 bytes are in fact used - when joining, the index is only used to resolve all records with a given test_fk.id value instead of all records with a given test_fk.id within the given range of test_main.id values.

Further, I would expect that the id_fk_id index would also be usable, although less suitable since test_main.id has many more values than test_fk.id (a larger number of smaller value ranges would need to be scanned). 

Suggested fix:
I don't have enough knowledge of the internal architecture of the optimizer and query engine to recommend a specific fix. It seems as if MySQL only considers criteria comparing values from both tables in index choice/use. I'm asking that other criteria against the two tables being joined also be evaluated against indexes used to support the actual join operation.

Further, I would expect that the id_fk_id index would also be 
     usable, although less suitable since test_main.id has many 
     more values than test_fk.id (a larger number of smaller value 
     ranges would need to be scanned). 

Actually, I wouldn't expect this particular index to be quite so usable now that I've re-read the range optimization documentation... however this was only a side note - not the main point of my report. Please don't take this note as a retraction of my bug - it is certainly not.

I should point out that it appears to be the range criteria against the "main_test.id" column that is the issue. Equality criteria with a constant against test_main seems to leverage the index as I'm hoping. For example:

alter table test_main add column x varchar(255);
update test_main set x='hello';
update test_main set x='goodbye' where id=9;
create index fk_id_x on test_main(fk_id, x);

mysql> explain select count(*) from test_main join test_fk on test_fk.id=fk_id where test_main.id > 3 and test_main.id < 10 and test_fk.name like 'b%' and x='goodbye';
+----+-------------+-----------+-------+-----------------------------------+---------+---------+---------------------------+------+-------------+
| id | select_type | table     | type  | possible_keys                     | key     | key_len | ref                       | rows | Extra       |
+----+-------------+-----------+-------+-----------------------------------+---------+---------+---------------------------+------+-------------+
|  1 | SIMPLE      | test_fk   | range | PRIMARY,name                      | name    | 257     | NULL                      |    2 | Using where |
|  1 | SIMPLE      | test_main | ref   | PRIMARY,fk_id_id,id_fk_id,fk_id_x | fk_id_x | 262     | jointest.test_fk.id,const |    2 | Using where |
+----+-------------+-----------+-------+-----------------------------------+---------+---------+---------------------------+------+-------------+
2 rows in set (0.00 sec)

So, it really appears that the optimizer can use extra criteria against one of the tables doing a join, but seemingly not using range optimizations?

Fixed in 6.0 by the work done for WL#2475 "Batched range read functions for
MyISAM/InnoDB"

With WL#2475 "Batched range read functions for MyISAM/InnoDB" merged into mysql-trunk (5.6.2) the query plan/explain for the initial JOIN in this report has changed slightly:

explain 
select count(*) from test_main join test_fk on test_main.fk_id=test_fk.id
where test_main.id > 3 and test_main.id < 10 and test_fk.name like 'b%';
id      select_type     table   type    possible_keys   key     key_len ref     rows    Extra
1       SIMPLE  test_fk range   PRIMARY,name    name    257     NULL    2       Using index condition
1       SIMPLE  test_main       ref     PRIMARY,fk_id_id,id_fk_id       fk_id_id        4       test.test_fk.id 2       Using where; Using index

This shows that Index Condition Pushdown (ICP) implemented by WL#2475 is used for reading data from the first table but not for the second table. This will have very little impact on the performance of the execution of the query as the JOIN will done similarly to the how it was done before WL#2475. Only if ICP had been used during access of the second table ("test_main") there would be a real effect on the performance.

This request was for MySQL to utilize more of the index on the second table to be used during the join. If I understand correctly the implementation of WL#2475 does not change this for the example given in this bug report.