Bug #38550 multiple optimizer bugs for queries with references from preceding table
Submitted: 4 Aug 2008 23:47 Modified: 16 Oct 2008 5:53
Reporter: Ruslan Zakirov
Status: Verified
Category:Server: Optimizer Severity:S5 (Performance)
Version:5.0.60, 5.1 OS:Any
Assigned to: Bugs System Target Version:
Tags: multiple-columns index, INDEX, ref, range, preceding table
Triage: D4 (Minor)

[4 Aug 2008 23:47] Ruslan Zakirov
Description:
Hi,

I reported bug #38049 and this bug report is related and goes further and closer to
queries we're using in our product. Data is the same when query is different:

SELECT STRAIGHT_JOIN DISTINCT g.id
FROM acl a, groups g
WHERE (g.domain = 'queue' OR g.domain = 'system')
      AND g.type = a.type

It's a range condition on g.domain and ref on g.type, when in previous we have references
for both columns.

Both InnoDB and MyISAM engines have bugs, let's start from MyISAM

CREATE TABLE `groups` (
  `id` int(11) NOT NULL,
  `domain` varchar(32) default NULL,
  `type` varchar(32) default NULL,
  PRIMARY KEY  (`id`),
  KEY `groups_d` (`domain`),
  KEY `groups_t` (`type`),
  KEY `groups_dt` (`domain`,`type`),
  KEY `groups_td` (`type`,`domain`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1

Test results:

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX (groups_t)
        WHERE (g.domain = 'queue' OR g.domain = 'system') AND g.type = a.type
$VAR1 = {
          'type' => 'ref'
          'rows' => '1004',
          'key' => 'groups_t',
          'key_len' => '35',
          'Extra' => 'Using where',
        };
Excellent estimations, everything is correct even if it's slowest candidate.

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX (groups_d)
        WHERE (g.domain = 'queue' OR g.domain = 'system') AND g.type = a.type
$VAR1 = {
          'type' => 'range'
          'rows' => '15',
          'key' => 'groups_d',
          'key_len' => '35',
          'Extra' => 'Using where',
        };
Also good. It's mysql's choice.

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX (groups_dt)
        WHERE (g.domain = 'queue' OR g.domain = 'system') AND g.type = a.type
$VAR1 = {
          'type' => 'range'
          'rows' => '17',
          'key' => 'groups_dt',
          'key_len' => '35',
          'Extra' => 'Using where',
        };
Here we have first problems:
1) key_len is still 35, so mysql will use only first part of the index when I think it
can use both conditions to do better range scan
2) Even if 1) is not possible then extra should have 'using index' as the index can be
used to filter records. Benchmarks below show that InnoDB benefits from it a little.
3) Slight shift in row estimations, worth investigation, but it's up to you to decide if
it's bug or not.

Desired output: range type with 4 rows and key_len is 70.

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX (groups_td)
        WHERE (g.domain = 'queue' OR g.domain = 'system') AND g.type = a.type
$VAR1 = {
          'type' => 'ref'
          'rows' => '1004',
          'key' => 'groups_td',
          'key_len' => '35',
          'Extra' => 'Using where',
        };
This is totally wrong. It should be range access using both columns, most probably mysql
estimate possibility of range access by rows estimations are wrong (close to 1004 or even
bigger) so ref access wins.

Ok let's consider that ref access is correct in this case then mysql should use index's
data to filter records after scan at least, see tests for innodb below.

Let's switch to InnoDB.

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX (groups_t)
        WHERE (g.domain = 'queue' OR g.domain = 'system') AND g.type = a.type
$VAR1 = {
          'type' => 'ref'
          'rows' => '2085',
          'key' => 'groups_t',
          'key_len' => '35',
          'Extra' => 'Using where',
        };

Rows estimation is twice bigger than it should be, not an order, but anyway.

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX (groups_d)
        WHERE (g.domain = 'queue' OR g.domain = 'system') AND g.type = a.type
$VAR1 = {
          'type' => 'range'
          'rows' => '16',
          'key' => 'groups_d',
          'key_len' => '35',
          'Extra' => 'Using where',
        };
Good results.

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX (groups_dt)
        WHERE (g.domain = 'queue' OR g.domain = 'system') AND g.type = a.type
$VAR1 = {
          'type' => 'range'
          'rows' => '16',
          'key' => 'groups_dt',
          'key_len' => '35',
          'Extra' => 'Using where; Using index',
        };
Again partial using of index for range scan, however:
1) uses index to filter out records. Benchmark shows a little speedup for InnoDB over
single-column index on domain column, when myisam is slower here and faster with the
previous query.
2) Rows estimation is better than for myisam considering that only first part of the
index is used for range scan.

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX (groups_td)
        WHERE (g.domain = 'queue' OR g.domain = 'system') AND g.type = a.type
$VAR1 = {
          'type' => 'ref'
          'rows' => '695',
          'key' => 'groups_td',
          'key_len' => '35',
          'Extra' => 'Using where; Using index',
        };
This index sucks totally:
1) Again ref access instead of range
2) Rows estimation is not even close to estimation based on single-column index on type
column and this is considering the fact that only this first part of the index will be
used.
3) The only benefit here is using index what gives twice speed up in benchmark.

Benchmarks:

          myisam         innodb
groups_t  5.487526      6.609085
groups_d  0.618371      0.567414
groups_dt 0.700012      0.502292
groups_td 5.788753      2.556196

How to repeat:
In the attachment you can find scripts to create data (create_data.pl), generate test
output (test.pl) and simple benchmark (bench.pl) which shows differences in execution
plans and correlations between EXPLAIN and actual execution.

Will attach archive with files.

Suggested fix:
Don't know.
[4 Aug 2008 23:49] Ruslan Zakirov
perl scripts to create, test and banch

Attachment: range_vs_ref_scan_with_preceding_table.tar.gz (application/gzip, text), 1.42 KiB.

[5 Aug 2008 10:08] Susanne Ebrecht
CREATE TABLE acl(id INT PRIMARY KEY, type varchar(32));

CREATE TABLE groups(id int(11) NOT NULL, domain varchar(32) default NULL, type
varchar(32) default NULL, PRIMARY KEY(id)) ENGINE=MyISAM;

INSERT INTO acl VALUES(1, 'owner'), (2, 'admincc');

DELIMITER //
create procedure fill_data()
begin
declare i integer default 1; 
declare j integer default 2;
declare k integer default 3;
declare l integer default 4;
declare t integer default 3;
insert into groups
values(i,'system','owner'),(j,'system','requester'),(k,'system','admincc'),(l,'system','cc');
set i= i+4;
set j= j+4;
set k= k+4;
set l= l+4;
while t > 0 do
insert into groups
values(i,'queue','owner'),(j,'queue','requester'),(k,'queue','admincc'),(l,'queue','cc');
set i= i+4;
set j= j+4;
set k= k+4;
set l= l+4;
set t= t-1;
end while;
set t:= 1000;
while t > 0 do
insert into groups
values(i,'ticket','owner'),(j,'ticket','requester'),(k,'ticket','admincc'),(l,'ticket','cc');
set i= i+4;
set j= j+4;
set k= k+4;
set l= l+4;
set t= t-1;
end while;
end//

DELIMITER ;
call fill_data();

CREATE INDEX groups_d ON groups(domain);
CREATE INDEX groups_t ON groups(type);
CREATE INDEX groups_dt ON groups(domain, type);
CREATE INDEX groups_td ON groups(type, domain);

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id FROM acl a, groups g USE INDEX(groups_t) WHERE
(g.domain='queue' OR g.domain='system') AND g.type=a.type\G

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id FROM acl a, groups g USE INDEX (groups_d)
WHERE (g.domain='queue' OR g.domain='system') AND g.type=a.type\G

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id FROM acl a, groups g USE INDEX(groups_dt)
WHERE (g.domain='queue' OR g.domain='system') AND g.type=a.type\G

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id FROM acl a, groups g USE INDEX(groups_td)
WHERE (g.domain='queue' OR g.domain='system') AND g.type=a.type\G

truncate acl;
truncate groups;

drop index groups_t on groups;
drop index groups_d on groups;
drop index groups_dt on groups;
drop index groups_td on groups;

alter table acl engine='innodb';
alter table groups engine='innodb';

call fill_data();

CREATE INDEX groups_d ON groups(domain);
CREATE INDEX groups_t ON groups(type);
CREATE INDEX groups_dt ON groups(domain, type);
CREATE INDEX groups_td ON groups(type, domain);

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id FROM acl a, groups g USE INDEX(groups_t) WHERE
(g.domain='queue' OR g.domain='system') AND g.type=a.type\G

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id FROM acl a, groups g USE INDEX (groups_d)
WHERE (g.domain='queue' OR g.domain='system') AND g.type=a.type\G

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id FROM acl a, groups g USE INDEX(groups_dt)
WHERE (g.domain='queue' OR g.domain='system') AND g.type=a.type\G

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id FROM acl a, groups g USE INDEX(groups_td)
WHERE (g.domain='queue' OR g.domain='system') AND g.type=a.type\G
[5 Aug 2008 10:10] Susanne Ebrecht
I can reproduce this with my tests on MySQL bzr tree for 5.0 and 5.1
[11 Sep 2008 18:13] Sergey Petrunya
> Here we have first problems:
> 1) key_len is still 35, so mysql will use only first part of the index when I think it
can
> use both conditions to do better range scan

No, at the moment range scan may not depend on another table (with exception
of references to constant tables which are constidered constants). It seems
you're asking about combining ref and range access together. We have this on
somewhere our todo but I can't tell when this will become available.

In MySQL-6.0 we have IndexConditionPushdown optimization which is expected
to rectify the situation a bit. With IndexConditionPushdown, the number of
index entries will be still the same but full table records will be
retrieved only if the whole condition is satisfied. 

> 2) Even if 1) is not possible then extra should have 'using index' as the index can be
> used to filter records. Benchmarks below show that InnoDB benefits from it a little.

It can't have "Using index" because g.id is not covered by the index. InnoDB
implicitly includes all primary key columns at the end of every index, MyISAM 
doesn't do that.

> 3) Slight shift in row estimations, worth investigation, but it's up to you to decide
if
> it's bug or not.

It's not a bug, the number is an estimate from the range optimizer estimate and some
variety 
in estimation is expected.

Desired output: range type with 4 rows and key_len is 70.

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX (groups_td)
        WHERE (g.domain = 'queue' OR g.domain = 'system') AND g.type = a.type
$VAR1 = {
          'type' => 'ref'
          'rows' => '1004',
          'key' => 'groups_td',
          'key_len' => '35',
          'Extra' => 'Using where',
        };
> This is totally wrong. It should be range access using both columns, most probably
mysql
> estimate possibility of range access by rows estimations are wrong (close to 1004 or
even
> bigger) so ref access wins.

See the above note about range access. Only ref access can be used for this query. Its
estimates are based on index cardinalities. Do you have a highly-skewed data
distribution?

Ok let's consider that ref access is correct in this case then mysql should use index's
data to filter records after scan at least, see tests for innodb below.

> Let's switch to InnoDB.

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX (groups_t)
        WHERE (g.domain = 'queue' OR g.domain = 'system') AND g.type = a.type
$VAR1 = {
          'type' => 'ref'
          'rows' => '2085',
          'key' => 'groups_t',
          'key_len' => '35',
          'Extra' => 'Using where',
        };

> Rows estimation is twice bigger than it should be, not an order, but anyway.

It's a known property of InnoDB. It shouldn't reach order-of-magnitude differences.
[11 Sep 2008 22:46] Sergey Petrunya
The issues raised in this bug are valid, but none of them is a "bug" in a sense that they
do not describe something which was expected to be implemented in released versions but
wasn't.

Changing status to ToBeFixedLater. 

Ruslan, thanks for the input.