Description:
I have a MyISAM table of one million rows with several indexes. The
table has been ANALYZEd so the optimizer knows everything about the
current key distribution. I want to run a query on this table with a
WHERE clause that could use either of two indexes (details below).
According to the manual, the optimizer should "normally" use the index
that leads to the lesser number of rows needing to be looked at.
However, it consistently chooses the index with the substantially
*larger* number of rows.
This is the table:
drop table if exists tracking_noid_myisam;
create table tracking_noid_myisam (
day date not null,
ad int not null,
clicks int not null,
impressions int not null,
client int not null
) engine=myisam;
alter table tracking_noid_myisam add primary key (day, ad);
alter table tracking_noid_myisam add index ad (ad);
alter table tracking_noid_myisam add index client (client);
-- Notes on how the table is filled: see below.
This is the query:
select day, ad, sum(clicks), sum(impressions)
from tracking_noid_myisam
where client=11 and day between '2007-01-01' and '2007-01-31'
group by day;
EXPLAIN tells us that the query will use index client, and that the
number of rows will be 36286.
If I add a "USE INDEX (PRIMARY)", it will accept this hint and tells me
that the number of rows to be looked at will be 24880. Since this is
substantially less than 36286, I would expect this index to be taken
automatically, but obviously it isn't.
Further experimentation shows that for a WHERE clause like
where client=11 and day between '2007-01-01' and '2007-01-21'
(note: reduced date range) the row count with "USE INDEX(PRIMARY)" will
be down to 15954, but without the USE INDEX it will still not be used,
although the row count from the index on "client" will, of course, still
be 36286.
Only when the clause looks like
where client=11 and day between '2007-01-01' and '2007-01-20'
the optimizer will voluntarily choose the primary key, telling me that
the row count will be 15305.
To sum up: the optimizer does think that 15305<36286, but it seems to
think that 15954>=36286.
(Your actual figures may vary somewhat, because the table filling
process has a small degree of randomness.)
How to repeat:
use test;
create table numbers (i int not null primary key);
drop procedure if exists numberfiller;
delimiter //
create procedure numberfiller(in n int)
begin
declare i int;
set i=1;
truncate numbers;
while i<=n do
insert into test.numbers values(i);
set i=i+1;
end while;
end;
//
delimiter ;
call numberfiller(1000);
drop table if exists tracking_noid_myisam;
create table tracking_noid_myisam (
day date not null,
ad int not null,
clicks int not null,
impressions int not null,
client int not null
) engine=myisam;
set @maxn = 1000;
insert into tracking_noid_myisam(day,ad,clicks,impressions,client)
select date_add('2006-01-01', interval a.i-1 day),
b.i,
floor(rand()*100),
floor(rand()*100),
floor(rand()*20)
from numbers as a join numbers as b
where a.i<=@maxn and b.i<=@maxn;
alter table tracking_noid_myisam add primary key (day, ad);
alter table tracking_noid_myisam add index ad (ad);
alter table tracking_noid_myisam add index client (client);
analyze table tracking_noid_myisam;
explain extended
select day, ad, sum(clicks), sum(impressions)
from tracking_noid_myisam
where client=11 and day between '2007-01-01' and '2007-01-31'
group by day\G
explain extended
select day, ad, sum(clicks), sum(impressions)
from tracking_noid_myisam use index(primary)
where client=11 and day between '2007-01-01' and '2007-01-31'
group by day\G