MySQL Bugs: #21849: select count distinct has rotten performance on at least versions 4&5 of mysql.

Bug #21849	select count distinct has rotten performance on at least versions 4&5 of mysql.
Submitted:	26 Aug 2006 10:07	Modified:	7 Apr 2010 14:54
Reporter:	Galt Barber	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S5 (Performance)
Version:	5.0.32-BK, 4 and 5, 5.1.42-bzr	OS:	Linux (linux)
Assigned to:	Georgi Kodinov	CPU Architecture:	Any
Tags:	count, distinct, SELECT

Description:

This is very slow in at least versions 4 and 5 of mysql.

Select count(distinct somefield) from sometable 

But this is quite speedy:

select count(*) from (select distinct somefield from sometable group by somefield) as somelabel;

I don't know why count distinct is so slow when distinct by itself is not slow.

It's sad when you see users having to resort to this:
mysql -BN -e 'select field from table' | sort -u | wc -l

Frankly, if you just did that it would make mysql
look smart instead of dumb and slow.
It took 15 minutes for select count(distinct field) on one table we have versus 30 seconds using the simple external unix commandline tools.
If I use the above workaround with subquery on mysql5,
it's only about 15 seconds.

How to repeat:
Create a table with a few million rows and a field called name varchar(255).
Nearly all the values in the name field are less than 20 in length.
Index the name field.

Suggested fix:
For at least the basic case of 

select count(distinct somefield) from sometable;

you could simply rewrite it with this sql code:

select count(*) from (select distinct somefield from sometable group by somefield) as somelabel;

Thank you for a problem report. Please, send SHOW CREATE TABLE and SHOW TABLE STATUS results for sometable that demonstrates the problem described.

I just tested with my chess database with + 4 mio. rows on 'localhost' (Windows) with a 1 Ghz PC and MySQL 5.0.24.

/*four queries below that return the same value: '4150229'*/

/* 1 */ select count(*) from game as no_s;
/* 2 */ select count(id) from game as no_s;
/*these two are both instantaneous*/

/*however using the DISTINCT clause slows down things a lot - approx 30 seconds for this one: */
/* 3 */ select count(distinct id) from game as no_s;

/*using a subquery is better here too for some reason */
/* 4 */ select count(*) from (select distinct id from game) as no_s;
/*takes about 10 seconds*/

BTW: it is a little silly here that /* 3 */ takes that amount of time.  As 'id' is the PK the DISTINCT clause could safely be ignored by the optimizer!  That consideration would also apply to Unique Keys.

The Schema in this case:

CREATE TABLE `Game` (                             
          `Id` int(11) NOT NULL,                          
          `CId` int(11) NOT NULL,                         
          `Idx` int(11) NOT NULL,                         
          `Attributes` smallint(6) NOT NULL default '0',  
          `OCId` int(11) default NULL,                    
          `OIdx` int(11) default NULL,                    
          `PlyCount` int(11) NOT NULL default '0',        
          `Result` tinyint(4) NOT NULL default '0',       
          `WhiteId` int(11) NOT NULL,                     
          `BlackId` int(11) NOT NULL,                     
          `WhiteELO` int(11) default NULL,                
          `BlackELO` int(11) default NULL,                
          `EventId` int(11) NOT NULL,                     
          `SiteId` int(11) NOT NULL,                      
          `GameDate` date default NULL,                   
          `EventDate` date default NULL,                  
          `DateFlags` smallint(6) NOT NULL default '0',   
          `OpeningId` int(11) NOT NULL,                   
          `ECO` char(3) default NULL,                     
          `AnnotatorId` int(11) NOT NULL,                 
          PRIMARY KEY  (`Id`),                            
          KEY `Game_1` (`Idx`),                           
          KEY `Game_2` (`Attributes`),                    
          KEY `Game_3` (`Result`),                        
          KEY `Game_4` (`WhiteId`),                       
          KEY `Game_5` (`BlackId`),                       
          KEY `Game_6` (`WhiteELO`),                      
          KEY `Game_7` (`BlackELO`),                      
          KEY `Game_8` (`EventId`),                       
          KEY `Game_9` (`SiteId`),                        
          KEY `Game_10` (`GameDate`),                     
          KEY `Game_11` (`EventDate`),                    
          KEY `Game_12` (`OpeningId`),                    
          KEY `Game_13` (`ECO`),                          
          KEY `Game_14` (`AnnotatorId`),                  
          KEY `Game_15` (`CId`,`Id`),                     
          KEY `Game_16` (`CId`,`Idx`,`Id`)                
        ) ENGINE=MyISAM DEFAULT CHARSET=utf8              

.. it is a table from the 'jose' chess program available from MySQLForge.  This program skips with a database of about 1.8 mio rows/games.

It 'ships' - not 'skips' ....

CREATE TABLE `snp` (
  `bin` smallint(5) unsigned NOT NULL default '0',
  `chrom` enum('unknown','chr','chr1','chr1_random','chr2','chr2_random','chr3','chr3_random','chr4','chr4_random','chr5','chr5_random','chr6','chr6_random','chr7','chr7_random','chr8','chr8_random','chr9','chr9_random','chr10','chr10_random','chr11','chr11_random','chr12','chr12_random','chr13','chr13_random','chr14','chr14_random','chr15','chr15_random','chr16','chr16_random','chr17','chr17_random','chr18','chr18_random','chr19','chr19_random','chr20','chr20_random','chr21','chr21_random','chr22','chr22_random','chr23','chr23_random','chr24','chr24_random','chr25','chr25_random','chr26','chr26_random','chr27','chr27_random','chr28','chr28_random','chr29','chr29_random','chr30','chr30_random','chr31','chr31_random','chr32','chr32_random','chr33','chr33_random','chr34','chr34_random','chr35','chr35_random','chr36','chr36_random','chr37','chr37_random','chr38','chr38_random','chr6_hla_hap1','chr6_hla_hap2','chrFinished','chr2L','chr2L_random','chr2R','chr2R_random','chr2h','chr2h_random','chr3L','chr3L_random','chr3R','chr3R_random','chr3h','chr3h_random','chrI','chrII','chrIII','chrIV','chrM','chrM_random','chrNA','chrNA_random','chrU','chrU_random','chrUn','chrUn_random','chrV','chrV_random','chrW','chrW_random','chrX','chrX_random','chrXh','chrXh_random','chrY','chrY_random','chrYh','chrYh_random','chrZ','chrZ_random') NOT NULL default 'unknown',
  `chromStart` int(10) unsigned NOT NULL default '0',
  `chromEnd` int(10) unsigned NOT NULL default '0',
  `name` varchar(255) NOT NULL default '',
  `score` int(10) unsigned NOT NULL default '0',
  `strand` enum('?','+','-') NOT NULL default '?',
  `observed` varchar(255) NOT NULL default '',
  `molType` enum('unknown','genomic','cDNA','mito','chloro') NOT NULL default 'unknown',
  `class` enum('unknown','snp','in-del','het','microsat','named','no-variation','mixed','mnp') NOT NULL default 'unknown',
  `valid` set('unknown','other-pop','by-frequency','by-cluster','by-2hit-2allele','by-hapmap','genotype') NOT NULL default 'unknown',
  `avHet` float NOT NULL default '0',
  `avHetSE` float NOT NULL default '0',
  `func` set('unknown','locus-region','coding','coding-synon','coding-nonsynon','mrna-utr','intron','splice-site','reference','exception') NOT NULL default 'unknown',
  `locType` enum('unknown','range','exact','between') NOT NULL default 'unknown',
  `source` enum('unknown','dbSnp','Affy10K','Affy10Kv2','Affy50K_HindIII','Affy50K_XbaI') NOT NULL default 'unknown',
  `exception` set('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24') NOT NULL default '',  KEY `chrom` (`chrom`,`bin`),
  KEY `chromStart` (`chrom`,`chromStart`),
  KEY `chromEnd` (`chrom`,`chromEnd`),
  KEY `name` (`name`)
) TYPE=MyISAM;

----------------

mysql> show table status like 'snp'\G
*************************** 1. row ***************************
           Name: snp
           Type: MyISAM
     Row_format: Dynamic
           Rows: 8859645
 Avg_row_length: 42
    Data_length: 403807920
Max_data_length: 4294967295
   Index_length: 396961792
      Data_free: 23422220
 Auto_increment: NULL
    Create_time: 2005-02-24 23:59:23
    Update_time: 2005-03-07 11:18:58
     Check_time: 2005-02-25 09:40:41
 Create_options:
        Comment:
1 row in set (0.01 sec)

------------------------------

p.s. you can probably even use this even simpler substition
for select count distinct somefield:

select count(0) from (select 0 from sometable group by somefield) as x;
select count(0) from (select 0 from snp group by name) as x;

mysql> select count(0) from (select 0 from snp group by name) as x;
+----------+
| count(0) |
+----------+
|  8485546 |
+----------+
1 row in set (8.83 sec)

mysql> select count(*) from snp;
+----------+
| count(*) |
+----------+
|  8859645 |
+----------+
1 row in set (0.00 sec)

------------------------------------

I found that this bug was reported clear over a year ago in 2005.
It's long overdue for a fix now.

-Galt

I added the info you wanted as a comment.

Sorry, but I was not able to repeat the behaviour described with your table and approx. 1 million of (random) rows, on 5.0.26-BK running on Linux:

openxs@suse:~/dbs/5.0> bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 6 to server version: 5.0.26-debug

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> CREATE TABLE `snp` (
    ->   `bin` smallint(5) unsigned NOT NULL default '0',
    ->   `chrom`
    -> enum('unknown','chr','chr1','chr1_random','chr2','chr2_random','chr3','c
hr3_rand
    '> om','chr4','chr4_random','chr5','chr5_random','chr6','chr6_random','chr7
','chr7_
    '> random','chr8','chr8_random','chr9','chr9_random','chr10','chr10_random'
,'chr11'
    -> ,'chr11_random','chr12','chr12_random','chr13','chr13_random','chr14','c
hr14_ran
    '> dom','chr15','chr15_random','chr16','chr16_random','chr17','chr17_random
','chr18
    '> ','chr18_random','chr19','chr19_random','chr20','chr20_random','chr21','
chr21_ra
    '> ndom','chr22','chr22_random','chr23','chr23_random','chr24','chr24_rando
m','chr2
    '> 5','chr25_random','chr26','chr26_random','chr27','chr27_random','chr28',
'chr28_r
    '> andom','chr29','chr29_random','chr30','chr30_random','chr31','chr31_rand
om','chr
    '> 32','chr32_random','chr33','chr33_random','chr34','chr34_random','chr35'
,'chr35_
    '> random','chr36','chr36_random','chr37','chr37_random','chr38','chr38_ran
dom','ch
    '> r6_hla_hap1','chr6_hla_hap2','chrFinished','chr2L','chr2L_random','chr2R
','chr2R
    '> _random','chr2h','chr2h_random','chr3L','chr3L_random','chr3R','chr3R_ra
ndom','c
    '> hr3h','chr3h_random','chrI','chrII','chrIII','chrIV','chrM','chrM_random
','chrNA
    '> ','chrNA_random','chrU','chrU_random','chrUn','chrUn_random','chrV','chr
V_random
    '> ','chrW','chrW_random','chrX','chrX_random','chrXh','chrXh_random','chrY
','chrY_
    '> random','chrYh','chrYh_random','chrZ','chrZ_random') NOT NULL default
    -> 'unknown',
    ->   `chromStart` int(10) unsigned NOT NULL default '0',
    ->   `chromEnd` int(10) unsigned NOT NULL default '0',
    ->   `name` varchar(255) NOT NULL default '',
    ->   `score` int(10) unsigned NOT NULL default '0',
    ->   `strand` enum('?','+','-') NOT NULL default '?',
    ->   `observed` varchar(255) NOT NULL default '',
    ->   `molType` enum('unknown','genomic','cDNA','mito','chloro') NOT NULL de
fault
    -> 'unknown',
    ->   `class`
    -> enum('unknown','snp','in-del','het','microsat','named','no-variation','m
ixed','m
    '> np') NOT NULL default 'unknown',
    ->   `valid`
    -> set('unknown','other-pop','by-frequency','by-cluster','by-2hit-2allele',
'by-hapm
    '> ap','genotype') NOT NULL default 'unknown',
    ->   `avHet` float NOT NULL default '0',
    ->   `avHetSE` float NOT NULL default '0',
    ->   `func`
    -> set('unknown','locus-region','coding','coding-synon','coding-nonsynon','
mrna-utr
    '> ','intron','splice-site','reference','exception') NOT NULL default 'unkn
own',
    ->   `locType` enum('unknown','range','exact','between') NOT NULL default
    -> 'unknown',
    ->   `source`
    -> enum('unknown','dbSnp','Affy10K','Affy10Kv2','Affy50K_HindIII','Affy50K_
XbaI')
    -> NOT NULL default 'unknown',
    ->   `exception`
    -> set('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','1
6','17',
    -> '18','19','20','21','22','23','24') NOT NULL default '',  KEY `chrom`
    -> (`chrom`,`bin`),
    ->   KEY `chromStart` (`chrom`,`chromStart`),
    ->   KEY `chromEnd` (`chrom`,`chromEnd`),
    ->   KEY `name` (`name`)
    -> ) TYPE=MyISAM;
Query OK, 0 rows affected, 1 warning (0.02 sec)

mysql> insert into snp(name) values(rand());
Query OK, 1 row affected (0.01 sec)

mysql> insert into snp(name) select rand() from snp;
Query OK, 1 row affected (0.01 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> insert into snp(name) select rand() from snp;
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

...

mysql> insert into snp(name) select rand() from snp;
Query OK, 524288 rows affected (1 min 43.68 sec)
Records: 524288  Duplicates: 0  Warnings: 0

mysql> select count(distinct name) from snp;
+----------------------+
| count(distinct name) |
+----------------------+
|              1010548 |
+----------------------+
1 row in set (2 min 6.35 sec)

mysql> select count(*) from (select distinct name from snp group by name) as t1
;
+----------+
| count(*) |
+----------+
|  1010548 |
+----------+
1 row in set (2 min 10.63 sec)

mysql> explain select count(distinct name) from snp; name from
+----+-------------+-------+-------+---------------+------+---------+------+----
-----+-------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | row
s    | Extra       |
+----+-------------+-------+-------+---------------+------+---------+------+----
-----+-------------+
|  1 | SIMPLE      | snp   | index | NULL          | name | 257     | NULL | 104
8576 | Using index |
+----+-------------+-------+-------+---------------+------+---------+------+----
-----+-------------+
1 row in set (0.01 sec)

mysql> explain select count(*) from (select distinct name from snp group by nam
e) as t1;
+----+-------------+-------+-------+---------------+------+---------+------+----
-----+------------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | row
s    | Extra                        |
+----+-------------+-------+-------+---------------+------+---------+------+----
-----+------------------------------+
|  1 | PRIMARY     | NULL  | NULL  | NULL          | NULL | NULL    | NULL |
NULL | Select tables optimized away |
|  2 | DERIVED     | snp   | index | NULL          | name | 257     | NULL | 104
8576 | Using index                  |
+----+-------------+-------+-------+---------------+------+---------+------+----
-----+------------------------------+
2 rows in set (2 min 12.00 sec)

In both cases we see "index" as access method, so index is full-scanned.

So, please, send your my.cnf content and a smallest set of data for the table that demonstate the behaviour described.

Galt, if you can reproduce this with the below, please reopen the bug (because I cannot re-open it), so I do not need to report a duplicate

the following testscript, reproduces this for me, on freebsd with 4.1.20 / 4.1.21 and 5.0.22 ( build from ports, with linux-thread )

the select count( distinct a) takes more than twice the time. The situation gets worse, if there are more unique values (increase the number in the per from 2000 to 10000.

create table distq (a varchar(20));

perl -wle '$a="12221232113422"; for (1..2000){$b=""; for(1..100){$b.="($a),";$a++;} $b=~s/,$//;`mysql -u root xxx -e "insert into distq values $b"`;  print $_ if $_=~/00$/;}'

insert into distq select * from distq;
insert into distq select * from distq; 
insert into distq select * from distq; 
alter table distq add index (a);

select count(*) from distq;
reset query cache; select count(distinct a) from distq;
reset query cache; select count(*) from (select distinct a from distq) x;

drop table distq;

select count(*) from distq;
+----------+
| count(*) |
+----------+
|  1600000 |
+----------+
1 row in set (0.00 sec)

reset query cache; select count(distinct a) from distq;
+-------------------+
| count(distinct a) |
+-------------------+
|            200000 |
+-------------------+
1 row in set (5.26 sec)

reset query cache; select count(*) from (select distinct a from distq) x;
+----------+
| count(*) |
+----------+
|   200000 |
+----------+
1 row in set (1.90 sec)

my.cnf
[mysqld]
server-id=55
skip-name-resolve
skip-show-database
skip-external-locking
set-variable =  query_cache_wlock_invalidate=1

sorry one more comment,
my test script is probably not needed,
The sql Valeriy Kravchuk used, causes the issue on my box too.

I found that this makes a noticable difference:

 set max_heap_table_size = 3721809920; # faster count(distinct )
 set max_heap_table_size = 16777216; # slower

just in case, all remaining variables (will attach)

Martin,

Thank you for the hint. I'll re-check with increased max_heap_table_size. But looks like the reason is more understandable now...

I've seen simulair issues with a query that does this 

SELECT portal_code as portal,
        site_code as site,
        page_name as page,
        page_id as page_id,
        DATE_FORMAT(datetime, '%Y-%m-%d') as logdate ,
        count(*) as hits,
        count(distinct msisdn) as msisdn
        FROM
        viewlog
        WHERE
        datetime >? AND
        datetime < ?
        GROUP BY
        portal_code,
        site_code,
        page_name,
        page_id,
        DATE_FORMAT(datetime, '%Y-%m-%d')
        ORDER BY
        DATE_FORMAT(datetime, '%Y-%m-%d')};

the table viewlog is having > 1 mil. rows for the where clause.

I tracked it down to a mass write i/o operation to the mysqltmp partition, 
with iostat -x 3 I measured 110% disk write utilisation for the time of 
the query - basicly shutting down our database system for the runtime of the query where we do not have the luxery of a dedicated raid system for mysqltmp. 

I tried to increase the sort_buffer_site to 4GB (we have 8GB in the machine),
but that does not help (also we run a 32bit os so higher values are pointless
as mysqld cannot allocate more memory).

As workaround we removed the count(distinct msisdn) as msisdn from the query
and solved in in application sorting. That made it cpu bound (better to control with nice) and improved performance with 50%.

I found some more info (which I believ to be interesting), I forced all temp tables to disk by "set max_heap_table_size = 0;" (it does set it to 16k, seems the minimum).

on the example from the initial test:

select sum(length(name)) from snp ; # about 14.4 MB
select sum(length(name)) from (select distinct name from snp group by name) as t1; # about 14.3 MB

select count(distinct name) from snp;
# creates a temp table on disk, with an index of about 15 MB

select count(*) from (select distinct name from snp group by name) as t1;
# creates a temp table on disk with a dada file of over 50 MB ???
(can any one verify that?; if yes, this may be a seperate bug)

create table snp1 select distinct name from snp group by name;
# creates a datafile, of about 15 MB

This makes me believe this is not a bug, or a dublicate of bug #22487

on the count(distinct name) mysql creates a new index. I believe this is correct, because mysql has to read all values from the original table (even if it does an index read), and then remove duplicate values.
(there could be an optimization issue, for a very few cases, where a count(distinct x), could use an existing index...)

The 2nd example (with subquery), can use the original existing query to retrieve only unique indexes. this makes it much faster (depending on how much time a system spends on creating a new unique index).
However the 2nd query can loose time on writing a far to large data file (at least on my box), therefore depending on tyour disk speed, it may be as slow as the 1st, or it may be faster, or even slower....

If my assumptions are correct, this is a duplicate of bug #22487

Can anyone check, if they experience the same oversized temp table, if running the subquery? If so, I can report that in a separate bug. Thanks

testing on kkr6u00:

create table galtSnp (name varchar(255));
load data local infile 'hg16.snp.name.dump' into table galtSnp;

(The data is 91MB in 8.8 million rows)

(NOTE: it has one extra header row "name" 
because I forgot to add -BN to dump command)

mysql> select SQL_NO_CACHE count(distinct name) from galtSnp;
+----------------------+
| count(distinct name) |
+----------------------+
|              8485547 |
+----------------------+
1 row in set (1 min 31.01 sec)

mysql> select * from galtSnp limit 5;
+------------+
| name       |
+------------+
| name       |
| rs10       |
| rs1000     |
| rs1000000  |
| rs10000000 |
+------------+
5 rows in set (0.00 sec)

NOTE: first we will do it with NO INDEX.

mysql> select SQL_NO_CACHE count(distinct name) from galtSnp;
+----------------------+
| count(distinct name) |
+----------------------+
|              8485547 |
+----------------------+
1 row in set (38.14 sec)

mysql> select SQL_NO_CACHE count(0) from (select 0 from galtSnp group by name) a
s x;
ERROR 3 (HY000): Error writing file '/tmp/MYXjXmp8' (Errcode: 28)
mysql> show warnings;
+-------+------+--------------------------------------------------+
| Level | Code | Message                                          |
+-------+------+--------------------------------------------------+
| Error |    3 | Error writing file '/tmp/MYXjXmp8' (Errcode: 28) |
| Error | 1028 | Sort aborted                                     |
| Error | 1105 | Unknown error                                    |
+-------+------+--------------------------------------------------+
3 rows in set (0.03 sec)

freespace on /tmp before query:
[kkr6u00.kilokluster.ucsc.edu:/tmp> df . -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3             4.9G  1.1G  3.6G  23% /

As it fails:
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3             4.9G  4.6G     0 100% /

This is some bug.

NOTE: Now, we will do it with an INDEX on name.

create index name on galtSnp(name);

mysql> select SQL_NO_CACHE count(0) from (select 0 from galtSnp group by name) as x;
+----------+
| count(0) |
+----------+
|  8485547 |
+----------+
1 row in set (8.50 sec)

So, clearly it is heavily index dependent.

------------
Now I'll have with and without index versions:

create table galtSnp2 as select * from galtSnp;

Q: Does it make any difference to include actual name?

select SQL_NO_CACHE count(name) from (select name from galtSnp group by name) as x;
1 row in set (27.81 sec)
A1: yes, but still consumes about 2.1 GB temp space and runs slower
than the count(0) version.

select SQL_NO_CACHE count(name) from (select name from galtSnp2 group by name) as x;
A2: no, without index, it still bombs out of space, so using name instead
of 0 made no difference.

[kkr6u00.kilokluster.ucsc.edu:/tmp> df . -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3             4.9G  4.6G     0 100% /

THIS IS SOME OTHER BUG!

-----------

Basic stats for comparison:

(INDEX)
mysql> select SQL_NO_CACHE count(distinct(name)) from galtSnp;
+-----------------------+
| count(distinct(name)) |
+-----------------------+
|               8485547 |
+-----------------------+
1 row in set (46.45 sec)

(NO INDEX)
mysql> select SQL_NO_CACHE count(distinct(name)) from galtSnp2;
+-----------------------+
| count(distinct(name)) |
+-----------------------+
|               8485547 |
+-----------------------+
1 row in set (2 min 19.63 sec)

Here's the command-line utility piping version:
[kkr6u00.kilokluster.ucsc.edu:~> date;hgsql hg16 -e 'select name from galtSnp2' | sort -u | wc -l;date
Mon Oct  2 11:51:01 PDT 2006
8485547
Mon Oct  2 11:51:17 PDT 2006

That's only 16 seconds!
If mysql can't beat this, they are doing something wrong.

This command-line piping version uses no index, 
yet it is 3x as fast as mysql5 with an index, 
and more than 8x faster than mysql5 without an index.

So, it is true that my fix would only make things better
in mysql 5.11 if the field for the select count(distinct field)
has an index.  Otherwise, due to some bug, mysql will 
consume unknown gigs and probably die.

I have reproduced the disk space usage, and added in Bug #22905

Martin, I was able to repeat the behaviour you described with 5.0.32-BK on Linux:

openxs@suse:~/dbs/5.0> bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 21
Server version: 5.0.32-debug Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> create table distq (a varchar(20));
Query OK, 0 rows affected (0.01 sec)

mysql> exit
Bye
openxs@suse:~/dbs/5.0> perl -wle '$a="12221232113422"; for (1..2000){$b="";
> for(1..100){$b.="($a),";$a++;} $b=~s/,$//;`bin/mysql -u root test -e "insert
into distq values $b"`;  print $_ if $_=~/00$/;}'
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
openxs@suse:~/dbs/5.0> bin/mysql -uroot test     422";
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2022
Server version: 5.0.32-debug Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> insert into distq select * from distq;
Query OK, 200000 rows affected (0.45 sec)
Records: 200000  Duplicates: 0  Warnings: 0

mysql> insert into distq select * from distq;
Query OK, 400000 rows affected (0.87 sec)
Records: 400000  Duplicates: 0  Warnings: 0

mysql> insert into distq select * from distq;
Query OK, 800000 rows affected (2.09 sec)
Records: 800000  Duplicates: 0  Warnings: 0

mysql> alter table distq add index (a);
Query OK, 1600000 rows affected (22.35 sec)
Records: 1600000  Duplicates: 0  Warnings: 0

mysql> select count(*) from distq;
+----------+
| count(*) |
+----------+
|  1600000 |
+----------+
1 row in set (0.01 sec)

But note that count(*) just means "number of rows", and for MyISAM table it is known.

mysql> reset query cache; select count(distinct a) from distq;
Query OK, 0 rows affected (0.00 sec)

+-------------------+
| count(distinct a) |
+-------------------+
|            200000 |
+-------------------+
1 row in set (9.00 sec)

mysql> reset query cache; select count(*) from (select distinct a from distq) x
;
Query OK, 0 rows affected (0.01 sec)

+----------+
| count(*) |
+----------+
|   200000 |
+----------+
1 row in set (2.96 sec)

While counting distinct values of non-unique column(!), that allows NULL values, requires index scan. Look at the explain results:

mysql> explain select count(*) from distq\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away
1 row in set (0.00 sec)

mysql> explain select count(distinct a) from distq\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: distq
         type: index
possible_keys: NULL
          key: a
      key_len: 23
          ref: NULL
         rows: 1600000
        Extra: Using index
1 row in set (0.00 sec)

mysql> explain select count(*) from (select distinct a from distq) x\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away
*************************** 2. row ***************************
           id: 2
  select_type: DERIVED
        table: distq
         type: index
possible_keys: NULL
          key: a
      key_len: 23
          ref: NULL
         rows: 1600000
        Extra: Using index
2 rows in set (2.95 sec)

The only "possible bug" I can see here is why:

mysql> select count(distinct a) from distq;
+-------------------+
| count(distinct a) |
+-------------------+
|            200000 |
+-------------------+
1 row in set (8.78 sec)

mysql> select count(distinct a) from distq;
+-------------------+
| count(distinct a) |
+-------------------+
|            200000 |
+-------------------+
1 row in set (8.71 sec)

is constantly(!) slower than

mysql> select count(*) from (select distinct a from distq) x;
+----------+
| count(*) |
+----------+
|   200000 |
+----------+
1 row in set (2.97 sec)

If this is inavidable and reasonable, optimizer should just rewrite queries to "faster" ones.

Yes the problem is the count(distinct)  VS the count-from-distinct-subquery

It may turn out hard to optimize, in cases the query does not stand alone.

My understanding is:
the subquery version, scans an existing index (index read) for unique values. this is very quick.

The count(distinct) reads the table (full table scan, even if it is from the index / a full index scan), and builds its own index for the count distinct.

In this stand-alone version of count distinct, this is obviously not very sensitive, but if we have a select a, count(distinct b) from x group by a; then mysql may already use the index on a, and cannot use an existing index for b.  so the index must be build.

2 thinks came to my mind: (both may be individual feature request, but I put them down here, and let you decide:

1)
count distinct could use existing indexes, if no other index is used, OR if the field(s) in question are on the current used index (combined key), and following the used fields immediately:
select a, count(distinct b) from x group by a; # with an index k1 (a,b)
could use k1 and scan the unique entries for b, on each value for a; or use an optimized count-distinct, which knows the table is read "order by a" for each count-distinct (so only the last value needs to be remembered, and  the counter increases on change)

This will help some but not all cases

2)
where a new index gets created. I started looking through the code, but did not get to finish this, so the below may be wrong.

It appears that a unique index will be build for each count(distinct). on building unique indexes, mysql does a read access before each write (so no roll-back will be needed on duplicate values). This is important, if the index file contains more than one index. I am not sure if it could be avoided, in this special case, where (on the tmp table for the count-distinct) only one index exists (I believe the index-write returns an error on dups anyway?)

The difference in performance between (among others)
SELECT COUNT(*) FROM (SELECT DISTINCT b FROM t1 GROUP BY a) dt1
and 
SELECT COUNT(DISTINCT b) FROM t1 GROUP BY a
for 
CREATE TABLE t1 (a INTEGER, b INTEGER, KEY (a,b));

comes (in 5.0 and later) from the fact that loose index scan can be used in the subquery case (see http://dev.mysql.com/doc/refman/5.1/en/loose-index-scan.html).

However currently loose index scan will not be used for the COUNT(DISTINCT) 
query above (although it is an equivalent of the subquery statement).
We are already aware of that limitation and are working towards lifting it. See http://forge.mysql.com/worklog/task.php?id=3220 for more details.

FYI, im running mysql 4 on Centos 4.5 on an 8CPU xeon server, and the suggested fix here - select count(*) from (select distinct product_id from table) as whatever - worked really well for me. I had slow queries that were bogging the system down, taking over several minutes, and now they are working in seconds. This was a great tip and yeilded a huge performance benefit in my situation. Thanks for the tip!

I think the problem is actually solved in 6.0.14:

77-52-7-73:5.1 openxs$ bin/mysql -uroot testReading table information for completion of table and column namesYou can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2004
Server version: 6.0.14-alpha-debug Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>  select count(distinct a) from distq;
+-------------------+
| count(distinct a) |
+-------------------+
|            200000 |
+-------------------+
1 row in set (1.36 sec)

mysql>  select count(*) from (select distinct a from distq) x;
+----------+
| count(*) |
+----------+
|   200000 |
+----------+
1 row in set (1.36 sec)

mysql> explain select count(distinct a) from distq;
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows   | Extra |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
|  1 | SIMPLE      | distq | ALL  | NULL          | NULL | NULL    | NULL | 200000 |       |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
1 row in set (0.00 sec)

mysql> explain select count(*) from (select distinct a from distq) x;
+----+-------------+-------+------+---------------+------+---------+------+--------+------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows   | Extra                        |
+----+-------------+-------+------+---------------+------+---------+------+--------+------------------------------+
|  1 | PRIMARY     | NULL  | NULL | NULL          | NULL | NULL    | NULL |   NULL | Select tables optimized away |
|  2 | DERIVED     | distq | ALL  | NULL          | NULL | NULL    | NULL | 200000 | Using temporary              |
+----+-------------+-------+------+---------------+------+---------+------+--------+------------------------------+
2 rows in set (1.36 sec)

mysql> alter table distq add index (a);
Query OK, 200000 rows affected (2.20 sec)
Records: 200000  Duplicates: 0  Warnings: 0

mysql>  select count(distinct a) from distq;
+-------------------+
| count(distinct a) |
+-------------------+
|            200000 |
+-------------------+
1 row in set (0.91 sec)

mysql>  select count(*) from (select distinct a from distq) x;
+----------+
| count(*) |
+----------+
|   200000 |
+----------+
1 row in set (1.07 sec)

mysql> explain select count(distinct a) from distq;
+----+-------------+-------+-------+---------------+------+---------+------+--------+-------------------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows   | Extra                               |
+----+-------------+-------+-------+---------------+------+---------+------+--------+-------------------------------------+
|  1 | SIMPLE      | distq | range | NULL          | a    | 23      | NULL | 200001 | Using index for group-by (scanning) |
+----+-------------+-------+-------+---------------+------+---------+------+--------+-------------------------------------+
1 row in set (0.00 sec)

mysql> explain select count(*) from (select distinct a from distq) x;
+----+-------------+-------+-------+---------------+------+---------+------+--------+------------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows   | Extra                        |
+----+-------------+-------+-------+---------------+------+---------+------+--------+------------------------------+
|  1 | PRIMARY     | NULL  | NULL  | NULL          | NULL | NULL    | NULL |   NULL | Select tables optimized away |
|  2 | DERIVED     | distq | index | NULL          | a    | 23      | NULL | 200000 | Using index                  |
+----+-------------+-------+-------+---------------+------+---------+------+--------+------------------------------+
2 rows in set (1.09 sec)

It is also solved in MySQL 5.5.0 (mysql-trunk tree to be released soon).

But even recent 5.42 is still affected:

77-52-7-73:5.1 openxs$ bin/mysql -uroot testReading table information for completion of table and column namesYou can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2002
Server version: 5.1.42-debug Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>  select count(distinct a) from distq;
+-------------------+
| count(distinct a) |
+-------------------+
|            200000 |
+-------------------+
1 row in set (1.51 sec)

mysql> select count(*) from (select distinct a from distq) x;
+----------+
| count(*) |
+----------+
|   200000 |
+----------+
1 row in set (0.98 sec)

mysql> alter table distq add index (a);
Query OK, 200000 rows affected (1.84 sec)
Records: 200000  Duplicates: 0  Warnings: 0

mysql>  select count(distinct a) from distq;
+-------------------+
| count(distinct a) |
+-------------------+
|            200000 |
+-------------------+
1 row in set (1.37 sec)

mysql> select count(*) from (select distinct a from distq) x;
+----------+
| count(*) |
+----------+
|   200000 |
+----------+
1 row in set (0.77 sec)

mysql> explain select count(distinct a) from distq;+----+-------------+-------+-------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows   | Extra       |
+----+-------------+-------+-------+---------------+------+---------+------+--------+-------------+
|  1 | SIMPLE      | distq | index | NULL          | a    | 23      | NULL | 200000 | Using index |
+----+-------------+-------+-------+---------------+------+---------+------+--------+-------------+
1 row in set (0.00 sec)

WL3220 is pushed into 5.5

Noted in 5.5.0 changelog.

SELECT COUNT(DISTINCT) was slow compared with SELECT DISTINCT. Now 
the server can use loose index scan for certain forms of aggregate
functions that use DISTINCT. See 
 http://dev.mysql.com/doc/refman/5.5/en/loose-index-scan.html