Bug #6274 SELECT DISTINCT on a column with few distinct values isn't optimized
Submitted: 27 Oct 2004 3:49 Modified: 8 Feb 2006 12:41
Reporter: Matthias Urlichs Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S4 (Feature request)
Version:- OS:-
Assigned to: CPU Architecture:Any

[27 Oct 2004 3:49] Matthias Urlichs
Description:
SELECT DISTINCT(foo) on an indexed column is very slow if the column contains many rows but only a few distinct values.

This is a real-world problem. We noticed it because analyzing our SNMP trap logging table, which is indexed by the source IP address and a few other fields, became very slow.

How to repeat:
create table test(foo char(20) not null, key(foo));

repeat 100000 times:
    insert into test values('one'),('two'),('three'),('four'),('five'),('six'),('seven'),('eight'),('nine');

SELECT DISTINCT(foo) FROM test;

The latter select takes a long time to complete. A loop in the client, however,

$val = SELECT MIN(foo) FROM test
while $val:
    SELECT MIN(foo) FROM test WHERE foo > $val

completes instantly.

Suggested fix:
Teach the index scan code to recognize this situation and switch algorithms.

Ideally, it should be able to do the same thing with this test case, where only the leading part of the index is static:

create table test(foo char(20) not null, bar int not null, baz int not null, key(foo,bar,baz));
repeat 100000 times:
    insert into test values('one',RAND(),RAND()),('two',RAND(),RAND()),('three',RAND(),RAND()),('four',RAND(),RAND()),('five',RAND(),RAND()),('six',RAND(),RAND()),('seven',RAND(),RAND()),('eight',RAND(),RAND()),('nine',RAND(),RAND());
select distinct(foo) from test;

though requiring a seperate index on just foo would work for us.
[8 Feb 2006 12:41] Valeriy Kravchuk
Looks like the appropriate optimization is already implemented in 5.0.x:

mysql> explain select distinct(foo) from test\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: test
         type: range
possible_keys: NULL
          key: foo
      key_len: 20
          ref: NULL
         rows: 10
        Extra: Using index for group-by
1 row in set (0.01 sec)

Please, read the manual (http://dev.mysql.com/doc/refman/5.0/en/explain.html):

"- Using index for group-by

Similar to the Using index way of accessing a table, Using index for group-by indicates that MySQL found an index that can be used to retrieve all columns of a GROUP BY or DISTINCT query without any extra disk access to the actual table. Additionally, the index is used in the most efficient way so that for each group, only a few index entries are read. For details, see Section 7.2.13, “GROUP BY Optimization”."