Bug #21849 | select count distinct has rotten performance on at least versions 4&5 of mysql. | ||
---|---|---|---|
Submitted: | 26 Aug 2006 10:07 | Modified: | 7 Apr 2010 14:54 |
Reporter: | Galt Barber | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Optimizer | Severity: | S5 (Performance) |
Version: | 5.0.32-BK, 4 and 5, 5.1.42-bzr | OS: | Linux (linux) |
Assigned to: | Georgi Kodinov | CPU Architecture: | Any |
Tags: | count, distinct, SELECT |
[26 Aug 2006 10:07]
Galt Barber
[26 Aug 2006 13:46]
Valeriy Kravchuk
Thank you for a problem report. Please, send SHOW CREATE TABLE and SHOW TABLE STATUS results for sometable that demonstrates the problem described.
[26 Aug 2006 16:50]
Peter Laursen
I just tested with my chess database with + 4 mio. rows on 'localhost' (Windows) with a 1 Ghz PC and MySQL 5.0.24. /*four queries below that return the same value: '4150229'*/ /* 1 */ select count(*) from game as no_s; /* 2 */ select count(id) from game as no_s; /*these two are both instantaneous*/ /*however using the DISTINCT clause slows down things a lot - approx 30 seconds for this one: */ /* 3 */ select count(distinct id) from game as no_s; /*using a subquery is better here too for some reason */ /* 4 */ select count(*) from (select distinct id from game) as no_s; /*takes about 10 seconds*/ BTW: it is a little silly here that /* 3 */ takes that amount of time. As 'id' is the PK the DISTINCT clause could safely be ignored by the optimizer! That consideration would also apply to Unique Keys. The Schema in this case: CREATE TABLE `Game` ( `Id` int(11) NOT NULL, `CId` int(11) NOT NULL, `Idx` int(11) NOT NULL, `Attributes` smallint(6) NOT NULL default '0', `OCId` int(11) default NULL, `OIdx` int(11) default NULL, `PlyCount` int(11) NOT NULL default '0', `Result` tinyint(4) NOT NULL default '0', `WhiteId` int(11) NOT NULL, `BlackId` int(11) NOT NULL, `WhiteELO` int(11) default NULL, `BlackELO` int(11) default NULL, `EventId` int(11) NOT NULL, `SiteId` int(11) NOT NULL, `GameDate` date default NULL, `EventDate` date default NULL, `DateFlags` smallint(6) NOT NULL default '0', `OpeningId` int(11) NOT NULL, `ECO` char(3) default NULL, `AnnotatorId` int(11) NOT NULL, PRIMARY KEY (`Id`), KEY `Game_1` (`Idx`), KEY `Game_2` (`Attributes`), KEY `Game_3` (`Result`), KEY `Game_4` (`WhiteId`), KEY `Game_5` (`BlackId`), KEY `Game_6` (`WhiteELO`), KEY `Game_7` (`BlackELO`), KEY `Game_8` (`EventId`), KEY `Game_9` (`SiteId`), KEY `Game_10` (`GameDate`), KEY `Game_11` (`EventDate`), KEY `Game_12` (`OpeningId`), KEY `Game_13` (`ECO`), KEY `Game_14` (`AnnotatorId`), KEY `Game_15` (`CId`,`Id`), KEY `Game_16` (`CId`,`Idx`,`Id`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 .. it is a table from the 'jose' chess program available from MySQLForge. This program skips with a database of about 1.8 mio rows/games.
[26 Aug 2006 19:11]
Peter Laursen
It 'ships' - not 'skips' ....
[29 Aug 2006 22:54]
Galt Barber
CREATE TABLE `snp` ( `bin` smallint(5) unsigned NOT NULL default '0', `chrom` enum('unknown','chr','chr1','chr1_random','chr2','chr2_random','chr3','chr3_random','chr4','chr4_random','chr5','chr5_random','chr6','chr6_random','chr7','chr7_random','chr8','chr8_random','chr9','chr9_random','chr10','chr10_random','chr11','chr11_random','chr12','chr12_random','chr13','chr13_random','chr14','chr14_random','chr15','chr15_random','chr16','chr16_random','chr17','chr17_random','chr18','chr18_random','chr19','chr19_random','chr20','chr20_random','chr21','chr21_random','chr22','chr22_random','chr23','chr23_random','chr24','chr24_random','chr25','chr25_random','chr26','chr26_random','chr27','chr27_random','chr28','chr28_random','chr29','chr29_random','chr30','chr30_random','chr31','chr31_random','chr32','chr32_random','chr33','chr33_random','chr34','chr34_random','chr35','chr35_random','chr36','chr36_random','chr37','chr37_random','chr38','chr38_random','chr6_hla_hap1','chr6_hla_hap2','chrFinished','chr2L','chr2L_random','chr2R','chr2R_random','chr2h','chr2h_random','chr3L','chr3L_random','chr3R','chr3R_random','chr3h','chr3h_random','chrI','chrII','chrIII','chrIV','chrM','chrM_random','chrNA','chrNA_random','chrU','chrU_random','chrUn','chrUn_random','chrV','chrV_random','chrW','chrW_random','chrX','chrX_random','chrXh','chrXh_random','chrY','chrY_random','chrYh','chrYh_random','chrZ','chrZ_random') NOT NULL default 'unknown', `chromStart` int(10) unsigned NOT NULL default '0', `chromEnd` int(10) unsigned NOT NULL default '0', `name` varchar(255) NOT NULL default '', `score` int(10) unsigned NOT NULL default '0', `strand` enum('?','+','-') NOT NULL default '?', `observed` varchar(255) NOT NULL default '', `molType` enum('unknown','genomic','cDNA','mito','chloro') NOT NULL default 'unknown', `class` enum('unknown','snp','in-del','het','microsat','named','no-variation','mixed','mnp') NOT NULL default 'unknown', `valid` set('unknown','other-pop','by-frequency','by-cluster','by-2hit-2allele','by-hapmap','genotype') NOT NULL default 'unknown', `avHet` float NOT NULL default '0', `avHetSE` float NOT NULL default '0', `func` set('unknown','locus-region','coding','coding-synon','coding-nonsynon','mrna-utr','intron','splice-site','reference','exception') NOT NULL default 'unknown', `locType` enum('unknown','range','exact','between') NOT NULL default 'unknown', `source` enum('unknown','dbSnp','Affy10K','Affy10Kv2','Affy50K_HindIII','Affy50K_XbaI') NOT NULL default 'unknown', `exception` set('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24') NOT NULL default '', KEY `chrom` (`chrom`,`bin`), KEY `chromStart` (`chrom`,`chromStart`), KEY `chromEnd` (`chrom`,`chromEnd`), KEY `name` (`name`) ) TYPE=MyISAM; ---------------- mysql> show table status like 'snp'\G *************************** 1. row *************************** Name: snp Type: MyISAM Row_format: Dynamic Rows: 8859645 Avg_row_length: 42 Data_length: 403807920 Max_data_length: 4294967295 Index_length: 396961792 Data_free: 23422220 Auto_increment: NULL Create_time: 2005-02-24 23:59:23 Update_time: 2005-03-07 11:18:58 Check_time: 2005-02-25 09:40:41 Create_options: Comment: 1 row in set (0.01 sec) ------------------------------ p.s. you can probably even use this even simpler substition for select count distinct somefield: select count(0) from (select 0 from sometable group by somefield) as x; select count(0) from (select 0 from snp group by name) as x; mysql> select count(0) from (select 0 from snp group by name) as x; +----------+ | count(0) | +----------+ | 8485546 | +----------+ 1 row in set (8.83 sec) mysql> select count(*) from snp; +----------+ | count(*) | +----------+ | 8859645 | +----------+ 1 row in set (0.00 sec) ------------------------------------ I found that this bug was reported clear over a year ago in 2005. It's long overdue for a fix now. -Galt
[29 Aug 2006 23:12]
Galt Barber
I added the info you wanted as a comment.
[25 Sep 2006 12:56]
Valeriy Kravchuk
Sorry, but I was not able to repeat the behaviour described with your table and approx. 1 million of (random) rows, on 5.0.26-BK running on Linux: openxs@suse:~/dbs/5.0> bin/mysql -uroot test Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 6 to server version: 5.0.26-debug Type 'help;' or '\h' for help. Type '\c' to clear the buffer. mysql> CREATE TABLE `snp` ( -> `bin` smallint(5) unsigned NOT NULL default '0', -> `chrom` -> enum('unknown','chr','chr1','chr1_random','chr2','chr2_random','chr3','c hr3_rand '> om','chr4','chr4_random','chr5','chr5_random','chr6','chr6_random','chr7 ','chr7_ '> random','chr8','chr8_random','chr9','chr9_random','chr10','chr10_random' ,'chr11' -> ,'chr11_random','chr12','chr12_random','chr13','chr13_random','chr14','c hr14_ran '> dom','chr15','chr15_random','chr16','chr16_random','chr17','chr17_random ','chr18 '> ','chr18_random','chr19','chr19_random','chr20','chr20_random','chr21',' chr21_ra '> ndom','chr22','chr22_random','chr23','chr23_random','chr24','chr24_rando m','chr2 '> 5','chr25_random','chr26','chr26_random','chr27','chr27_random','chr28', 'chr28_r '> andom','chr29','chr29_random','chr30','chr30_random','chr31','chr31_rand om','chr '> 32','chr32_random','chr33','chr33_random','chr34','chr34_random','chr35' ,'chr35_ '> random','chr36','chr36_random','chr37','chr37_random','chr38','chr38_ran dom','ch '> r6_hla_hap1','chr6_hla_hap2','chrFinished','chr2L','chr2L_random','chr2R ','chr2R '> _random','chr2h','chr2h_random','chr3L','chr3L_random','chr3R','chr3R_ra ndom','c '> hr3h','chr3h_random','chrI','chrII','chrIII','chrIV','chrM','chrM_random ','chrNA '> ','chrNA_random','chrU','chrU_random','chrUn','chrUn_random','chrV','chr V_random '> ','chrW','chrW_random','chrX','chrX_random','chrXh','chrXh_random','chrY ','chrY_ '> random','chrYh','chrYh_random','chrZ','chrZ_random') NOT NULL default -> 'unknown', -> `chromStart` int(10) unsigned NOT NULL default '0', -> `chromEnd` int(10) unsigned NOT NULL default '0', -> `name` varchar(255) NOT NULL default '', -> `score` int(10) unsigned NOT NULL default '0', -> `strand` enum('?','+','-') NOT NULL default '?', -> `observed` varchar(255) NOT NULL default '', -> `molType` enum('unknown','genomic','cDNA','mito','chloro') NOT NULL de fault -> 'unknown', -> `class` -> enum('unknown','snp','in-del','het','microsat','named','no-variation','m ixed','m '> np') NOT NULL default 'unknown', -> `valid` -> set('unknown','other-pop','by-frequency','by-cluster','by-2hit-2allele', 'by-hapm '> ap','genotype') NOT NULL default 'unknown', -> `avHet` float NOT NULL default '0', -> `avHetSE` float NOT NULL default '0', -> `func` -> set('unknown','locus-region','coding','coding-synon','coding-nonsynon',' mrna-utr '> ','intron','splice-site','reference','exception') NOT NULL default 'unkn own', -> `locType` enum('unknown','range','exact','between') NOT NULL default -> 'unknown', -> `source` -> enum('unknown','dbSnp','Affy10K','Affy10Kv2','Affy50K_HindIII','Affy50K_ XbaI') -> NOT NULL default 'unknown', -> `exception` -> set('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','1 6','17', -> '18','19','20','21','22','23','24') NOT NULL default '', KEY `chrom` -> (`chrom`,`bin`), -> KEY `chromStart` (`chrom`,`chromStart`), -> KEY `chromEnd` (`chrom`,`chromEnd`), -> KEY `name` (`name`) -> ) TYPE=MyISAM; Query OK, 0 rows affected, 1 warning (0.02 sec) mysql> insert into snp(name) values(rand()); Query OK, 1 row affected (0.01 sec) mysql> insert into snp(name) select rand() from snp; Query OK, 1 row affected (0.01 sec) Records: 1 Duplicates: 0 Warnings: 0 mysql> insert into snp(name) select rand() from snp; Query OK, 2 rows affected (0.00 sec) Records: 2 Duplicates: 0 Warnings: 0 ... mysql> insert into snp(name) select rand() from snp; Query OK, 524288 rows affected (1 min 43.68 sec) Records: 524288 Duplicates: 0 Warnings: 0 mysql> select count(distinct name) from snp; +----------------------+ | count(distinct name) | +----------------------+ | 1010548 | +----------------------+ 1 row in set (2 min 6.35 sec) mysql> select count(*) from (select distinct name from snp group by name) as t1 ; +----------+ | count(*) | +----------+ | 1010548 | +----------+ 1 row in set (2 min 10.63 sec) mysql> explain select count(distinct name) from snp; name from +----+-------------+-------+-------+---------------+------+---------+------+---- -----+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | row s | Extra | +----+-------------+-------+-------+---------------+------+---------+------+---- -----+-------------+ | 1 | SIMPLE | snp | index | NULL | name | 257 | NULL | 104 8576 | Using index | +----+-------------+-------+-------+---------------+------+---------+------+---- -----+-------------+ 1 row in set (0.01 sec) mysql> explain select count(*) from (select distinct name from snp group by nam e) as t1; +----+-------------+-------+-------+---------------+------+---------+------+---- -----+------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | row s | Extra | +----+-------------+-------+-------+---------------+------+---------+------+---- -----+------------------------------+ | 1 | PRIMARY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away | | 2 | DERIVED | snp | index | NULL | name | 257 | NULL | 104 8576 | Using index | +----+-------------+-------+-------+---------------+------+---------+------+---- -----+------------------------------+ 2 rows in set (2 min 12.00 sec) In both cases we see "index" as access method, so index is full-scanned. So, please, send your my.cnf content and a smallest set of data for the table that demonstate the behaviour described.
[25 Sep 2006 14:36]
Martin Friebe
Galt, if you can reproduce this with the below, please reopen the bug (because I cannot re-open it), so I do not need to report a duplicate the following testscript, reproduces this for me, on freebsd with 4.1.20 / 4.1.21 and 5.0.22 ( build from ports, with linux-thread ) the select count( distinct a) takes more than twice the time. The situation gets worse, if there are more unique values (increase the number in the per from 2000 to 10000. create table distq (a varchar(20)); perl -wle '$a="12221232113422"; for (1..2000){$b=""; for(1..100){$b.="($a),";$a++;} $b=~s/,$//;`mysql -u root xxx -e "insert into distq values $b"`; print $_ if $_=~/00$/;}' insert into distq select * from distq; insert into distq select * from distq; insert into distq select * from distq; alter table distq add index (a); select count(*) from distq; reset query cache; select count(distinct a) from distq; reset query cache; select count(*) from (select distinct a from distq) x; drop table distq; select count(*) from distq; +----------+ | count(*) | +----------+ | 1600000 | +----------+ 1 row in set (0.00 sec) reset query cache; select count(distinct a) from distq; +-------------------+ | count(distinct a) | +-------------------+ | 200000 | +-------------------+ 1 row in set (5.26 sec) reset query cache; select count(*) from (select distinct a from distq) x; +----------+ | count(*) | +----------+ | 200000 | +----------+ 1 row in set (1.90 sec) my.cnf [mysqld] server-id=55 skip-name-resolve skip-show-database skip-external-locking set-variable = query_cache_wlock_invalidate=1
[25 Sep 2006 14:44]
Martin Friebe
sorry one more comment, my test script is probably not needed, The sql Valeriy Kravchuk used, causes the issue on my box too.
[25 Sep 2006 14:59]
Martin Friebe
I found that this makes a noticable difference: set max_heap_table_size = 3721809920; # faster count(distinct ) set max_heap_table_size = 16777216; # slower just in case, all remaining variables (will attach)
[26 Sep 2006 6:50]
Valeriy Kravchuk
Martin, Thank you for the hint. I'll re-check with increased max_heap_table_size. But looks like the reason is more understandable now...
[26 Sep 2006 10:49]
Alex Rutgers
I've seen simulair issues with a query that does this SELECT portal_code as portal, site_code as site, page_name as page, page_id as page_id, DATE_FORMAT(datetime, '%Y-%m-%d') as logdate , count(*) as hits, count(distinct msisdn) as msisdn FROM viewlog WHERE datetime >? AND datetime < ? GROUP BY portal_code, site_code, page_name, page_id, DATE_FORMAT(datetime, '%Y-%m-%d') ORDER BY DATE_FORMAT(datetime, '%Y-%m-%d')}; the table viewlog is having > 1 mil. rows for the where clause. I tracked it down to a mass write i/o operation to the mysqltmp partition, with iostat -x 3 I measured 110% disk write utilisation for the time of the query - basicly shutting down our database system for the runtime of the query where we do not have the luxery of a dedicated raid system for mysqltmp. I tried to increase the sort_buffer_site to 4GB (we have 8GB in the machine), but that does not help (also we run a 32bit os so higher values are pointless as mysqld cannot allocate more memory). As workaround we removed the count(distinct msisdn) as msisdn from the query and solved in in application sorting. That made it cpu bound (better to control with nice) and improved performance with 50%.
[27 Sep 2006 11:49]
Martin Friebe
I found some more info (which I believ to be interesting), I forced all temp tables to disk by "set max_heap_table_size = 0;" (it does set it to 16k, seems the minimum). on the example from the initial test: select sum(length(name)) from snp ; # about 14.4 MB select sum(length(name)) from (select distinct name from snp group by name) as t1; # about 14.3 MB select count(distinct name) from snp; # creates a temp table on disk, with an index of about 15 MB select count(*) from (select distinct name from snp group by name) as t1; # creates a temp table on disk with a dada file of over 50 MB ??? (can any one verify that?; if yes, this may be a seperate bug) create table snp1 select distinct name from snp group by name; # creates a datafile, of about 15 MB This makes me believe this is not a bug, or a dublicate of bug #22487 on the count(distinct name) mysql creates a new index. I believe this is correct, because mysql has to read all values from the original table (even if it does an index read), and then remove duplicate values. (there could be an optimization issue, for a very few cases, where a count(distinct x), could use an existing index...) The 2nd example (with subquery), can use the original existing query to retrieve only unique indexes. this makes it much faster (depending on how much time a system spends on creating a new unique index). However the 2nd query can loose time on writing a far to large data file (at least on my box), therefore depending on tyour disk speed, it may be as slow as the 1st, or it may be faster, or even slower.... If my assumptions are correct, this is a duplicate of bug #22487 Can anyone check, if they experience the same oversized temp table, if running the subquery? If so, I can report that in a separate bug. Thanks
[2 Oct 2006 19:01]
Galt Barber
testing on kkr6u00: create table galtSnp (name varchar(255)); load data local infile 'hg16.snp.name.dump' into table galtSnp; (The data is 91MB in 8.8 million rows) (NOTE: it has one extra header row "name" because I forgot to add -BN to dump command) mysql> select SQL_NO_CACHE count(distinct name) from galtSnp; +----------------------+ | count(distinct name) | +----------------------+ | 8485547 | +----------------------+ 1 row in set (1 min 31.01 sec) mysql> select * from galtSnp limit 5; +------------+ | name | +------------+ | name | | rs10 | | rs1000 | | rs1000000 | | rs10000000 | +------------+ 5 rows in set (0.00 sec) NOTE: first we will do it with NO INDEX. mysql> select SQL_NO_CACHE count(distinct name) from galtSnp; +----------------------+ | count(distinct name) | +----------------------+ | 8485547 | +----------------------+ 1 row in set (38.14 sec) mysql> select SQL_NO_CACHE count(0) from (select 0 from galtSnp group by name) a s x; ERROR 3 (HY000): Error writing file '/tmp/MYXjXmp8' (Errcode: 28) mysql> show warnings; +-------+------+--------------------------------------------------+ | Level | Code | Message | +-------+------+--------------------------------------------------+ | Error | 3 | Error writing file '/tmp/MYXjXmp8' (Errcode: 28) | | Error | 1028 | Sort aborted | | Error | 1105 | Unknown error | +-------+------+--------------------------------------------------+ 3 rows in set (0.03 sec) freespace on /tmp before query: [kkr6u00.kilokluster.ucsc.edu:/tmp> df . -h Filesystem Size Used Avail Use% Mounted on /dev/sda3 4.9G 1.1G 3.6G 23% / As it fails: Filesystem Size Used Avail Use% Mounted on /dev/sda3 4.9G 4.6G 0 100% / This is some bug. NOTE: Now, we will do it with an INDEX on name. create index name on galtSnp(name); mysql> select SQL_NO_CACHE count(0) from (select 0 from galtSnp group by name) as x; +----------+ | count(0) | +----------+ | 8485547 | +----------+ 1 row in set (8.50 sec) So, clearly it is heavily index dependent. ------------ Now I'll have with and without index versions: create table galtSnp2 as select * from galtSnp; Q: Does it make any difference to include actual name? select SQL_NO_CACHE count(name) from (select name from galtSnp group by name) as x; 1 row in set (27.81 sec) A1: yes, but still consumes about 2.1 GB temp space and runs slower than the count(0) version. select SQL_NO_CACHE count(name) from (select name from galtSnp2 group by name) as x; A2: no, without index, it still bombs out of space, so using name instead of 0 made no difference. [kkr6u00.kilokluster.ucsc.edu:/tmp> df . -h Filesystem Size Used Avail Use% Mounted on /dev/sda3 4.9G 4.6G 0 100% / THIS IS SOME OTHER BUG! ----------- Basic stats for comparison: (INDEX) mysql> select SQL_NO_CACHE count(distinct(name)) from galtSnp; +-----------------------+ | count(distinct(name)) | +-----------------------+ | 8485547 | +-----------------------+ 1 row in set (46.45 sec) (NO INDEX) mysql> select SQL_NO_CACHE count(distinct(name)) from galtSnp2; +-----------------------+ | count(distinct(name)) | +-----------------------+ | 8485547 | +-----------------------+ 1 row in set (2 min 19.63 sec) Here's the command-line utility piping version: [kkr6u00.kilokluster.ucsc.edu:~> date;hgsql hg16 -e 'select name from galtSnp2' | sort -u | wc -l;date Mon Oct 2 11:51:01 PDT 2006 8485547 Mon Oct 2 11:51:17 PDT 2006 That's only 16 seconds! If mysql can't beat this, they are doing something wrong. This command-line piping version uses no index, yet it is 3x as fast as mysql5 with an index, and more than 8x faster than mysql5 without an index. So, it is true that my fix would only make things better in mysql 5.11 if the field for the select count(distinct field) has an index. Otherwise, due to some bug, mysql will consume unknown gigs and probably die.
[2 Oct 2006 20:23]
Martin Friebe
I have reproduced the disk space usage, and added in Bug #22905
[28 Nov 2006 15:17]
Valeriy Kravchuk
Martin, I was able to repeat the behaviour you described with 5.0.32-BK on Linux: openxs@suse:~/dbs/5.0> bin/mysql -uroot test Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 21 Server version: 5.0.32-debug Source distribution Type 'help;' or '\h' for help. Type '\c' to clear the buffer. mysql> create table distq (a varchar(20)); Query OK, 0 rows affected (0.01 sec) mysql> exit Bye openxs@suse:~/dbs/5.0> perl -wle '$a="12221232113422"; for (1..2000){$b=""; > for(1..100){$b.="($a),";$a++;} $b=~s/,$//;`bin/mysql -u root test -e "insert into distq values $b"`; print $_ if $_=~/00$/;}' 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 openxs@suse:~/dbs/5.0> bin/mysql -uroot test 422"; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 2022 Server version: 5.0.32-debug Source distribution Type 'help;' or '\h' for help. Type '\c' to clear the buffer. mysql> insert into distq select * from distq; Query OK, 200000 rows affected (0.45 sec) Records: 200000 Duplicates: 0 Warnings: 0 mysql> insert into distq select * from distq; Query OK, 400000 rows affected (0.87 sec) Records: 400000 Duplicates: 0 Warnings: 0 mysql> insert into distq select * from distq; Query OK, 800000 rows affected (2.09 sec) Records: 800000 Duplicates: 0 Warnings: 0 mysql> alter table distq add index (a); Query OK, 1600000 rows affected (22.35 sec) Records: 1600000 Duplicates: 0 Warnings: 0 mysql> select count(*) from distq; +----------+ | count(*) | +----------+ | 1600000 | +----------+ 1 row in set (0.01 sec) But note that count(*) just means "number of rows", and for MyISAM table it is known. mysql> reset query cache; select count(distinct a) from distq; Query OK, 0 rows affected (0.00 sec) +-------------------+ | count(distinct a) | +-------------------+ | 200000 | +-------------------+ 1 row in set (9.00 sec) mysql> reset query cache; select count(*) from (select distinct a from distq) x ; Query OK, 0 rows affected (0.01 sec) +----------+ | count(*) | +----------+ | 200000 | +----------+ 1 row in set (2.96 sec) While counting distinct values of non-unique column(!), that allows NULL values, requires index scan. Look at the explain results: mysql> explain select count(*) from distq\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: NULL type: NULL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: NULL Extra: Select tables optimized away 1 row in set (0.00 sec) mysql> explain select count(distinct a) from distq\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: distq type: index possible_keys: NULL key: a key_len: 23 ref: NULL rows: 1600000 Extra: Using index 1 row in set (0.00 sec) mysql> explain select count(*) from (select distinct a from distq) x\G *************************** 1. row *************************** id: 1 select_type: PRIMARY table: NULL type: NULL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: NULL Extra: Select tables optimized away *************************** 2. row *************************** id: 2 select_type: DERIVED table: distq type: index possible_keys: NULL key: a key_len: 23 ref: NULL rows: 1600000 Extra: Using index 2 rows in set (2.95 sec) The only "possible bug" I can see here is why: mysql> select count(distinct a) from distq; +-------------------+ | count(distinct a) | +-------------------+ | 200000 | +-------------------+ 1 row in set (8.78 sec) mysql> select count(distinct a) from distq; +-------------------+ | count(distinct a) | +-------------------+ | 200000 | +-------------------+ 1 row in set (8.71 sec) is constantly(!) slower than mysql> select count(*) from (select distinct a from distq) x; +----------+ | count(*) | +----------+ | 200000 | +----------+ 1 row in set (2.97 sec) If this is inavidable and reasonable, optimizer should just rewrite queries to "faster" ones.
[28 Nov 2006 15:42]
Martin Friebe
Yes the problem is the count(distinct) VS the count-from-distinct-subquery It may turn out hard to optimize, in cases the query does not stand alone. My understanding is: the subquery version, scans an existing index (index read) for unique values. this is very quick. The count(distinct) reads the table (full table scan, even if it is from the index / a full index scan), and builds its own index for the count distinct. In this stand-alone version of count distinct, this is obviously not very sensitive, but if we have a select a, count(distinct b) from x group by a; then mysql may already use the index on a, and cannot use an existing index for b. so the index must be build. 2 thinks came to my mind: (both may be individual feature request, but I put them down here, and let you decide: 1) count distinct could use existing indexes, if no other index is used, OR if the field(s) in question are on the current used index (combined key), and following the used fields immediately: select a, count(distinct b) from x group by a; # with an index k1 (a,b) could use k1 and scan the unique entries for b, on each value for a; or use an optimized count-distinct, which knows the table is read "order by a" for each count-distinct (so only the last value needs to be remembered, and the counter increases on change) This will help some but not all cases 2) where a new index gets created. I started looking through the code, but did not get to finish this, so the below may be wrong. It appears that a unique index will be build for each count(distinct). on building unique indexes, mysql does a read access before each write (so no roll-back will be needed on duplicate values). This is important, if the index file contains more than one index. I am not sure if it could be avoided, in this special case, where (on the tmp table for the count-distinct) only one index exists (I believe the index-write returns an error on dups anyway?)
[21 Mar 2007 11:12]
Georgi Kodinov
The difference in performance between (among others) SELECT COUNT(*) FROM (SELECT DISTINCT b FROM t1 GROUP BY a) dt1 and SELECT COUNT(DISTINCT b) FROM t1 GROUP BY a for CREATE TABLE t1 (a INTEGER, b INTEGER, KEY (a,b)); comes (in 5.0 and later) from the fact that loose index scan can be used in the subquery case (see http://dev.mysql.com/doc/refman/5.1/en/loose-index-scan.html). However currently loose index scan will not be used for the COUNT(DISTINCT) query above (although it is an equivalent of the subquery statement). We are already aware of that limitation and are working towards lifting it. See http://forge.mysql.com/worklog/task.php?id=3220 for more details.
[19 Sep 2007 16:52]
Bill Goodyear
FYI, im running mysql 4 on Centos 4.5 on an 8CPU xeon server, and the suggested fix here - select count(*) from (select distinct product_id from table) as whatever - worked really well for me. I had slow queries that were bogging the system down, taking over several minutes, and now they are working in seconds. This was a great tip and yeilded a huge performance benefit in my situation. Thanks for the tip!
[30 Nov 2009 19:38]
Valeriy Kravchuk
I think the problem is actually solved in 6.0.14: 77-52-7-73:5.1 openxs$ bin/mysql -uroot testReading table information for completion of table and column namesYou can turn off this feature to get a quicker startup with -A Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 2004 Server version: 6.0.14-alpha-debug Source distribution Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> select count(distinct a) from distq; +-------------------+ | count(distinct a) | +-------------------+ | 200000 | +-------------------+ 1 row in set (1.36 sec) mysql> select count(*) from (select distinct a from distq) x; +----------+ | count(*) | +----------+ | 200000 | +----------+ 1 row in set (1.36 sec) mysql> explain select count(distinct a) from distq; +----+-------------+-------+------+---------------+------+---------+------+--------+-------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+---------------+------+---------+------+--------+-------+ | 1 | SIMPLE | distq | ALL | NULL | NULL | NULL | NULL | 200000 | | +----+-------------+-------+------+---------------+------+---------+------+--------+-------+ 1 row in set (0.00 sec) mysql> explain select count(*) from (select distinct a from distq) x; +----+-------------+-------+------+---------------+------+---------+------+--------+------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+---------------+------+---------+------+--------+------------------------------+ | 1 | PRIMARY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away | | 2 | DERIVED | distq | ALL | NULL | NULL | NULL | NULL | 200000 | Using temporary | +----+-------------+-------+------+---------------+------+---------+------+--------+------------------------------+ 2 rows in set (1.36 sec) mysql> alter table distq add index (a); Query OK, 200000 rows affected (2.20 sec) Records: 200000 Duplicates: 0 Warnings: 0 mysql> select count(distinct a) from distq; +-------------------+ | count(distinct a) | +-------------------+ | 200000 | +-------------------+ 1 row in set (0.91 sec) mysql> select count(*) from (select distinct a from distq) x; +----------+ | count(*) | +----------+ | 200000 | +----------+ 1 row in set (1.07 sec) mysql> explain select count(distinct a) from distq; +----+-------------+-------+-------+---------------+------+---------+------+--------+-------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+------+---------+------+--------+-------------------------------------+ | 1 | SIMPLE | distq | range | NULL | a | 23 | NULL | 200001 | Using index for group-by (scanning) | +----+-------------+-------+-------+---------------+------+---------+------+--------+-------------------------------------+ 1 row in set (0.00 sec) mysql> explain select count(*) from (select distinct a from distq) x; +----+-------------+-------+-------+---------------+------+---------+------+--------+------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+------+---------+------+--------+------------------------------+ | 1 | PRIMARY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away | | 2 | DERIVED | distq | index | NULL | a | 23 | NULL | 200000 | Using index | +----+-------------+-------+-------+---------------+------+---------+------+--------+------------------------------+ 2 rows in set (1.09 sec)
[30 Nov 2009 19:40]
Valeriy Kravchuk
It is also solved in MySQL 5.5.0 (mysql-trunk tree to be released soon).
[30 Nov 2009 19:43]
Valeriy Kravchuk
But even recent 5.42 is still affected: 77-52-7-73:5.1 openxs$ bin/mysql -uroot testReading table information for completion of table and column namesYou can turn off this feature to get a quicker startup with -A Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 2002 Server version: 5.1.42-debug Source distribution Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> select count(distinct a) from distq; +-------------------+ | count(distinct a) | +-------------------+ | 200000 | +-------------------+ 1 row in set (1.51 sec) mysql> select count(*) from (select distinct a from distq) x; +----------+ | count(*) | +----------+ | 200000 | +----------+ 1 row in set (0.98 sec) mysql> alter table distq add index (a); Query OK, 200000 rows affected (1.84 sec) Records: 200000 Duplicates: 0 Warnings: 0 mysql> select count(distinct a) from distq; +-------------------+ | count(distinct a) | +-------------------+ | 200000 | +-------------------+ 1 row in set (1.37 sec) mysql> select count(*) from (select distinct a from distq) x; +----------+ | count(*) | +----------+ | 200000 | +----------+ 1 row in set (0.77 sec) mysql> explain select count(distinct a) from distq;+----+-------------+-------+-------+---------------+------+---------+------+--------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+------+---------+------+--------+-------------+ | 1 | SIMPLE | distq | index | NULL | a | 23 | NULL | 200000 | Using index | +----+-------------+-------+-------+---------------+------+---------+------+--------+-------------+ 1 row in set (0.00 sec)
[7 Apr 2010 10:06]
Georgi Kodinov
WL3220 is pushed into 5.5
[7 Apr 2010 14:54]
Paul DuBois
Noted in 5.5.0 changelog. SELECT COUNT(DISTINCT) was slow compared with SELECT DISTINCT. Now the server can use loose index scan for certain forms of aggregate functions that use DISTINCT. See http://dev.mysql.com/doc/refman/5.5/en/loose-index-scan.html