MySQL Bugs: #24251: Non-optimal choice of indexes for queries

Bug #24251	Non-optimal choice of indexes for queries
Submitted:	13 Nov 2006 10:58	Modified:	23 Nov 2009 9:10
Reporter:	Gisbert Selke (Basic Quality Contributor)	Email Updates:
Status:	In progress	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S5 (Performance)
Version:	5.0.30-BK, 5.1.12	OS:	Linux (Linux, Win XP SP2)
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	qc

Description:
I have a MyISAM table of one million rows with several indexes. The
table has been ANALYZEd so the optimizer knows everything about the
current key distribution. I want to run a query on this table with a
WHERE clause that could use either of two indexes (details below).
According to the manual, the optimizer should "normally" use the index
that leads to the lesser number of rows needing to be looked at.
However, it consistently chooses the index with the substantially
*larger* number of rows.

This is the table:
drop table if exists tracking_noid_myisam;
create table tracking_noid_myisam (
  day date not null,
  ad int not null,
  clicks int not null,
  impressions int not null,
  client int not null
) engine=myisam;
alter table tracking_noid_myisam add primary key  (day, ad);
alter table tracking_noid_myisam add index ad     (ad);
alter table tracking_noid_myisam add index client (client);
-- Notes on how the table is filled: see below.

This is the query:
select day, ad, sum(clicks), sum(impressions)
  from tracking_noid_myisam
  where client=11 and day between '2007-01-01' and '2007-01-31'
  group by day;

EXPLAIN tells us that the query will use index client, and that the
number of rows will be 36286.
If I add a "USE INDEX (PRIMARY)", it will accept this hint and tells me
that the number of rows to be looked at will be 24880. Since this is
substantially less than 36286, I would expect this index to be taken
automatically, but obviously it isn't.

Further experimentation shows that for a WHERE clause like
  where client=11 and day between '2007-01-01' and '2007-01-21'
(note: reduced date range)  the row count with "USE INDEX(PRIMARY)" will
be down to 15954, but without the USE INDEX it will still not be used,
although the row count from the index on "client" will, of course, still
be 36286.

Only when the clause looks like
  where client=11 and day between '2007-01-01' and '2007-01-20'
the optimizer will voluntarily choose the primary key, telling me that
the row count will be 15305.

To sum up: the optimizer does think that 15305<36286, but it seems to
think that 15954>=36286.

(Your actual figures may vary somewhat, because the table filling
process has a small degree of randomness.)

How to repeat:
use test;
create table numbers (i int not null primary key);

drop procedure if exists numberfiller;
delimiter //
create procedure numberfiller(in n int)
begin
  declare i int;
  set i=1;
  truncate numbers;
  while i<=n do
    insert into test.numbers values(i);
    set i=i+1;
  end while;
end;
//
delimiter ;

call numberfiller(1000);

drop table if exists tracking_noid_myisam;
create table tracking_noid_myisam (
  day date not null,
  ad int not null,
  clicks int not null,
  impressions int not null,
  client int not null
) engine=myisam;

set @maxn = 1000;
insert into tracking_noid_myisam(day,ad,clicks,impressions,client)
  select date_add('2006-01-01', interval a.i-1 day),
         b.i,
         floor(rand()*100),
         floor(rand()*100),
         floor(rand()*20)
    from numbers as a join numbers as b
    where a.i<=@maxn and b.i<=@maxn;

alter table tracking_noid_myisam add primary key  (day, ad);
alter table tracking_noid_myisam add index ad     (ad);
alter table tracking_noid_myisam add index client (client);

analyze table tracking_noid_myisam;

explain extended
select day, ad, sum(clicks), sum(impressions)
  from tracking_noid_myisam
  where client=11 and day between '2007-01-01' and '2007-01-31'
  group by day\G

explain extended
select day, ad, sum(clicks), sum(impressions)
  from tracking_noid_myisam use index(primary)
  where client=11 and day between '2007-01-01' and '2007-01-31'
  group by day\G

Thank you for a bug report. Verified just as described, also - with 5.0.30-BK on Linux. The problem is not only in "incorrect" estimation of cost based on of rows, but plan with USE INDEX(primary) runs faster by wall-clock time. (.19-.21 sec vs .28-.29 sec without hint, in my case).

In general a plan that uses 'ref' access plan selecting N rows can be cheaper than
a plan using 'range' access and fetching less than N rows.
The fact is that at the 'ref' access, if it uses all index components, rows are fetched in the order of their ROWIDs. Such fetching is cheaper than random access of rows by range index scan.

However some real problem can be demonstrated with the reported case: the impact of sorting is not always taken into account correctly by the optimizer. 
This will be fixed in 5.1 because the fix affects the choice of execution plans.

Well, I cannot comment on ref vs. range accesses, because I lack the deep knowledge. However, note Valeriy's comment of 14 Nov that with the hint USE INDEX(primary) the actual running time is only about 2/3 of the time that would follow from the plan chosen by the optimizer.