MySQL Bugs: #38550: multiple optimizer bugs for queries with references from preceding table

Bug #38550	multiple optimizer bugs for queries with references from preceding table
Submitted:	4 Aug 2008 21:47	Modified:	16 Oct 2008 3:53
Reporter:	Ruslan Zakirov	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S5 (Performance)
Version:	5.0.60, 5.1	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	INDEX, multiple-columns index, preceding table, range, ref

Description:
Hi,

I reported bug #38049 and this bug report is related and goes further and closer to queries we're using in our product. Data is the same when query is different:

SELECT STRAIGHT_JOIN DISTINCT g.id
FROM acl a, groups g
WHERE (g.domain = 'queue' OR g.domain = 'system')
      AND g.type = a.type

It's a range condition on g.domain and ref on g.type, when in previous we have references for both columns.

Both InnoDB and MyISAM engines have bugs, let's start from MyISAM

CREATE TABLE `groups` (
  `id` int(11) NOT NULL,
  `domain` varchar(32) default NULL,
  `type` varchar(32) default NULL,
  PRIMARY KEY  (`id`),
  KEY `groups_d` (`domain`),
  KEY `groups_t` (`type`),
  KEY `groups_dt` (`domain`,`type`),
  KEY `groups_td` (`type`,`domain`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1

Test results:

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX (groups_t)
        WHERE (g.domain = 'queue' OR g.domain = 'system') AND g.type = a.type
$VAR1 = {
          'type' => 'ref'
          'rows' => '1004',
          'key' => 'groups_t',
          'key_len' => '35',
          'Extra' => 'Using where',
        };
Excellent estimations, everything is correct even if it's slowest candidate.

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX (groups_d)
        WHERE (g.domain = 'queue' OR g.domain = 'system') AND g.type = a.type
$VAR1 = {
          'type' => 'range'
          'rows' => '15',
          'key' => 'groups_d',
          'key_len' => '35',
          'Extra' => 'Using where',
        };
Also good. It's mysql's choice.

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX (groups_dt)
        WHERE (g.domain = 'queue' OR g.domain = 'system') AND g.type = a.type
$VAR1 = {
          'type' => 'range'
          'rows' => '17',
          'key' => 'groups_dt',
          'key_len' => '35',
          'Extra' => 'Using where',
        };
Here we have first problems:
1) key_len is still 35, so mysql will use only first part of the index when I think it can use both conditions to do better range scan
2) Even if 1) is not possible then extra should have 'using index' as the index can be used to filter records. Benchmarks below show that InnoDB benefits from it a little.
3) Slight shift in row estimations, worth investigation, but it's up to you to decide if it's bug or not.

Desired output: range type with 4 rows and key_len is 70.

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX (groups_td)
        WHERE (g.domain = 'queue' OR g.domain = 'system') AND g.type = a.type
$VAR1 = {
          'type' => 'ref'
          'rows' => '1004',
          'key' => 'groups_td',
          'key_len' => '35',
          'Extra' => 'Using where',
        };
This is totally wrong. It should be range access using both columns, most probably mysql estimate possibility of range access by rows estimations are wrong (close to 1004 or even bigger) so ref access wins.

Ok let's consider that ref access is correct in this case then mysql should use index's data to filter records after scan at least, see tests for innodb below.

Let's switch to InnoDB.

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX (groups_t)
        WHERE (g.domain = 'queue' OR g.domain = 'system') AND g.type = a.type
$VAR1 = {
          'type' => 'ref'
          'rows' => '2085',
          'key' => 'groups_t',
          'key_len' => '35',
          'Extra' => 'Using where',
        };

Rows estimation is twice bigger than it should be, not an order, but anyway.

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX (groups_d)
        WHERE (g.domain = 'queue' OR g.domain = 'system') AND g.type = a.type
$VAR1 = {
          'type' => 'range'
          'rows' => '16',
          'key' => 'groups_d',
          'key_len' => '35',
          'Extra' => 'Using where',
        };
Good results.

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX (groups_dt)
        WHERE (g.domain = 'queue' OR g.domain = 'system') AND g.type = a.type
$VAR1 = {
          'type' => 'range'
          'rows' => '16',
          'key' => 'groups_dt',
          'key_len' => '35',
          'Extra' => 'Using where; Using index',
        };
Again partial using of index for range scan, however:
1) uses index to filter out records. Benchmark shows a little speedup for InnoDB over single-column index on domain column, when myisam is slower here and faster with the previous query.
2) Rows estimation is better than for myisam considering that only first part of the index is used for range scan.

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX (groups_td)
        WHERE (g.domain = 'queue' OR g.domain = 'system') AND g.type = a.type
$VAR1 = {
          'type' => 'ref'
          'rows' => '695',
          'key' => 'groups_td',
          'key_len' => '35',
          'Extra' => 'Using where; Using index',
        };
This index sucks totally:
1) Again ref access instead of range
2) Rows estimation is not even close to estimation based on single-column index on type column and this is considering the fact that only this first part of the index will be used.
3) The only benefit here is using index what gives twice speed up in benchmark.

Benchmarks:

          myisam         innodb
groups_t  5.487526      6.609085
groups_d  0.618371      0.567414
groups_dt 0.700012      0.502292
groups_td 5.788753      2.556196

How to repeat:
In the attachment you can find scripts to create data (create_data.pl), generate test output (test.pl) and simple benchmark (bench.pl) which shows differences in execution plans and correlations between EXPLAIN and actual execution.

Will attach archive with files.

Suggested fix:
Don't know.

perl scripts to create, test and banch

Attachment: range_vs_ref_scan_with_preceding_table.tar.gz (application/gzip, text), 1.42 KiB.

CREATE TABLE acl(id INT PRIMARY KEY, type varchar(32));

CREATE TABLE groups(id int(11) NOT NULL, domain varchar(32) default NULL, type varchar(32) default NULL, PRIMARY KEY(id)) ENGINE=MyISAM;

INSERT INTO acl VALUES(1, 'owner'), (2, 'admincc');

DELIMITER //
create procedure fill_data()
begin
declare i integer default 1; 
declare j integer default 2;
declare k integer default 3;
declare l integer default 4;
declare t integer default 3;
insert into groups values(i,'system','owner'),(j,'system','requester'),(k,'system','admincc'),(l,'system','cc');
set i= i+4;
set j= j+4;
set k= k+4;
set l= l+4;
while t > 0 do
insert into groups values(i,'queue','owner'),(j,'queue','requester'),(k,'queue','admincc'),(l,'queue','cc');
set i= i+4;
set j= j+4;
set k= k+4;
set l= l+4;
set t= t-1;
end while;
set t:= 1000;
while t > 0 do
insert into groups values(i,'ticket','owner'),(j,'ticket','requester'),(k,'ticket','admincc'),(l,'ticket','cc');
set i= i+4;
set j= j+4;
set k= k+4;
set l= l+4;
set t= t-1;
end while;
end//

DELIMITER ;
call fill_data();

CREATE INDEX groups_d ON groups(domain);
CREATE INDEX groups_t ON groups(type);
CREATE INDEX groups_dt ON groups(domain, type);
CREATE INDEX groups_td ON groups(type, domain);

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id FROM acl a, groups g USE INDEX(groups_t) WHERE (g.domain='queue' OR g.domain='system') AND g.type=a.type\G

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id FROM acl a, groups g USE INDEX (groups_d) WHERE (g.domain='queue' OR g.domain='system') AND g.type=a.type\G

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id FROM acl a, groups g USE INDEX(groups_dt) WHERE (g.domain='queue' OR g.domain='system') AND g.type=a.type\G

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id FROM acl a, groups g USE INDEX(groups_td) WHERE (g.domain='queue' OR g.domain='system') AND g.type=a.type\G

truncate acl;
truncate groups;

drop index groups_t on groups;
drop index groups_d on groups;
drop index groups_dt on groups;
drop index groups_td on groups;

alter table acl engine='innodb';
alter table groups engine='innodb';

call fill_data();

CREATE INDEX groups_d ON groups(domain);
CREATE INDEX groups_t ON groups(type);
CREATE INDEX groups_dt ON groups(domain, type);
CREATE INDEX groups_td ON groups(type, domain);

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id FROM acl a, groups g USE INDEX(groups_t) WHERE (g.domain='queue' OR g.domain='system') AND g.type=a.type\G

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id FROM acl a, groups g USE INDEX (groups_d) WHERE (g.domain='queue' OR g.domain='system') AND g.type=a.type\G

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id FROM acl a, groups g USE INDEX(groups_dt) WHERE (g.domain='queue' OR g.domain='system') AND g.type=a.type\G

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id FROM acl a, groups g USE INDEX(groups_td) WHERE (g.domain='queue' OR g.domain='system') AND g.type=a.type\G

I can reproduce this with my tests on MySQL bzr tree for 5.0 and 5.1

> Here we have first problems:
> 1) key_len is still 35, so mysql will use only first part of the index when I think it can
> use both conditions to do better range scan

No, at the moment range scan may not depend on another table (with exception
of references to constant tables which are constidered constants). It seems
you're asking about combining ref and range access together. We have this on
somewhere our todo but I can't tell when this will become available.

In MySQL-6.0 we have IndexConditionPushdown optimization which is expected
to rectify the situation a bit. With IndexConditionPushdown, the number of
index entries will be still the same but full table records will be
retrieved only if the whole condition is satisfied. 

> 2) Even if 1) is not possible then extra should have 'using index' as the index can be
> used to filter records. Benchmarks below show that InnoDB benefits from it a little.

It can't have "Using index" because g.id is not covered by the index. InnoDB
implicitly includes all primary key columns at the end of every index, MyISAM 
doesn't do that.

> 3) Slight shift in row estimations, worth investigation, but it's up to you to decide if
> it's bug or not.

It's not a bug, the number is an estimate from the range optimizer estimate and some variety 
in estimation is expected.

Desired output: range type with 4 rows and key_len is 70.

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX (groups_td)
        WHERE (g.domain = 'queue' OR g.domain = 'system') AND g.type = a.type
$VAR1 = {
          'type' => 'ref'
          'rows' => '1004',
          'key' => 'groups_td',
          'key_len' => '35',
          'Extra' => 'Using where',
        };
> This is totally wrong. It should be range access using both columns, most probably mysql
> estimate possibility of range access by rows estimations are wrong (close to 1004 or even
> bigger) so ref access wins.

See the above note about range access. Only ref access can be used for this query. Its estimates are based on index cardinalities. Do you have a highly-skewed data distribution?

Ok let's consider that ref access is correct in this case then mysql should use index's
data to filter records after scan at least, see tests for innodb below.

> Let's switch to InnoDB.

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX (groups_t)
        WHERE (g.domain = 'queue' OR g.domain = 'system') AND g.type = a.type
$VAR1 = {
          'type' => 'ref'
          'rows' => '2085',
          'key' => 'groups_t',
          'key_len' => '35',
          'Extra' => 'Using where',
        };

> Rows estimation is twice bigger than it should be, not an order, but anyway.

It's a known property of InnoDB. It shouldn't reach order-of-magnitude differences.

The issues raised in this bug are valid, but none of them is a "bug" in a sense that they do not describe something which was expected to be implemented in released versions but wasn't.

Changing status to ToBeFixedLater. 

Ruslan, thanks for the input.

Was about to file a bug for this, but found this by searching...the notes I typed up and simple use case may be useful in the fix:

When referenced in a join (by the first column), a compound index is not able to be used for (col1 equality,col2 range), despite having the data in the key structured to easily do so...and being able to use it for (col1 equality,col2 equality)...or use it on a single table query.  See example below

Simple example case...a table of server farms, and a log of when an arbitrary event happens on one of them.  A report query (the one in question) wants to find the counts of log events per farm for a given time range.  Note that you'll have to replace the times I used with whatever you inserted, or just manually specify on the inserts.

CREATE TABLE farm (
  id tinyint unsigned NOT NULL AUTO_INCREMENT,
  name varchar(50) NOT NULL,
  PRIMARY KEY (id),
  UNIQUE KEY name (name)
) ENGINE=InnoDB;

CREATE TABLE log (
  id int unsigned NOT NULL AUTO_INCREMENT,
  farm_id tinyint unsigned NOT NULL,
  log_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (id),
  KEY (farm_id,log_time)
) ENGINE=InnoDB;

insert into farm values (1,'test farm 1'),(2,'test farm 2');
insert into log (farm_id) select id from farm;
select sleep(1);
insert into log (farm_id) select farm_id from log;
select sleep(1);
insert into log (farm_id) select farm_id from log;
select sleep(1);
insert into log (farm_id) select farm_id from log;
select sleep(1);
insert into log (farm_id) select farm_id from log;
select sleep(1);
insert into log (farm_id) select farm_id from log;
select sleep(1);
insert into log (farm_id) select farm_id from log;
select sleep(1);
insert into log (farm_id) select farm_id from log;
select sleep(1);
insert into log (farm_id) select farm_id from log;
select sleep(1);
insert into log (farm_id) select farm_id from log;
select sleep(1);
insert into log (farm_id) select farm_id from log;
select sleep(1);
insert into log (farm_id) select farm_id from log;
select sleep(1);
insert into log (farm_id) select farm_id from log;
select sleep(1);
insert into log (farm_id) select farm_id from log;

select log.farm_id,count(*)
from log
where log.farm_id = 1 and log.log_time between '2012-07-18 16:58:17' and '2012-07-18 16:58:18'
group by log.farm_id;
1	SIMPLE	log	range	farm_id	farm_id	5	NULL	1	Using where; Using index
-- expected behavior, uses (equal,range) when using single table

select farm.id,count(*)
from farm
left join log on log.farm_id = farm.id and log.log_time = '2012-07-18 16:58:17'
where farm.id=1
group by farm.id;
1	SIMPLE	farm	const	PRIMARY	PRIMARY	1	const	1	Using index
1	SIMPLE	log	ref	farm_id	farm_id	5	const,const	256	Using index
-- expected behavior, dual const (equal,equal) lookup on compound key

select farm.id,count(*)
from farm
left join log on log.farm_id = farm.id and log.log_time between '2012-07-18 16:58:17' and '2012-07-18 16:58:18'
where farm.id=1
group by farm.id;
1	SIMPLE	farm	const	PRIMARY	PRIMARY	1	const	1	Using index
1	SIMPLE	log	range	farm_id	farm_id	5	NULL	767	Using where; Using index
-- expected behavior, when farm_id is provided it uses the range correctly as (equal,range), so it knows the join is on that equality

select farm.id,count(*)
from farm
left join log on log.farm_id = farm.id and log.log_time = '2012-07-18 16:58:17'
group by farm.id;
1	SIMPLE	farm	index	NULL	PRIMARY	1	NULL	1	Using index
1	SIMPLE	log	ref	farm_id	farm_id	5	test.farm.id,const	428	Using index
-- expected behavior, see how it uses that compound key to do an equality on ref'd farm_id and then a const for the time provided, (equal,equal) in index order

select farm.id,count(*)
from farm
left join log on log.farm_id = farm.id and log.log_time between '2012-07-18 16:58:17' and '2012-07-18 16:58:18'
group by farm.id;
1	SIMPLE	farm	index	NULL	PRIMARY	1	NULL	1	Using index
1	SIMPLE	log	ref	farm_id	farm_id	1	test.farm.id	8572	Using index
-- unexpected!  where in the previous query it was able to consider the ref'd column an equality and use columns further down the index for equalities like the (equal,range) scans above, in this case it does not consider it so - usually you can have a key on (col1,col2) and do an equality on col1 and a range on col2, and let it use the index for it...and the above query would imply that mysql knows this...but it throws it away with a range on the second column and only scans by the first ('rows' is the same when the time condition is removed, key_len=1 instead of 5, and performance/rows examined proves this in my much larger use case)

Note that this might also be able to optimize queries with a key on (col1,col2) that do "select foo from bar where col2>range_condition group by col1" to use the key - the data is there, structured in order.

How feasible is this?  I can see great use cases for it.  In reference to my above example, something like this:
select farm_id,count(*)
from log
where log_time between '2012-07-18 16:58:17' and '2012-07-18 16:58:18'
group by farm_id;
1	SIMPLE	log	index	NULL	farm_id	5	NULL	17144	Using where; Using index
-- does a full index scan and doesn't use the range condition while it of course could as the group by is being optimized by the key

Also, to note, still exists in 5.5.17