Bug #38213 MySQL does not seem to execute select count (distinct) efficiently
Submitted: 17 Jul 2008 20:46 Modified: 7 Apr 2010 14:54
Reporter: Zardosht Kasheff (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:5.1.23-rc OS:Any
Assigned to: Georgi Kodinov CPU Architecture:Any

[17 Jul 2008 20:46] Zardosht Kasheff
Description:
take a table with the following definition:

| z1    | CREATE TABLE `z1` (
  `a` int(11) NOT NULL DEFAULT '0',
  `b` int(11) NOT NULL DEFAULT '0',
  `c` int(11) DEFAULT NULL,
  `d` int(11) DEFAULT NULL,
  PRIMARY KEY (`a`,`b`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |

and take the following query:

select count(distinct a) from z1;

This query seems to execute inefficiently. 

Suppose one has a table with 1 million rows where the primary keys are (1,1), (1,2), ...(1,1000),(2,1), (2,2)...(2,1000), ...(1000,1)...(1000,1000). When one does the following query: "select distinct a from z1;" this query runs very fast. The reason this runs very fast is that MySQL makes successive index_read calls to the handler with the flag HA_READ_AFTER_KEY. However, when executing "select count(distinct a) from z1;", MySQL makes successive index_next calls to the handler. The difference here is that the index_read calls can skip over duplicate values of 'a', whereas the index_next call is processing them. This seems inefficient. The example above shows a dramatic difference in performance between the two query executions.

The resulting behavior is that the query "select distinct a ..." runs noticeably faster than "select count (distinct a)...".

How to repeat:
explained above

Suggested fix:
"select count (distinct a) ..." should make successive index_read calls with the flag HA_READ_AFTER_KEY.
[18 Jul 2008 3:59] Valeriy Kravchuk
Thank you for a problem report. It sounds like a reasonable feature request to apply a loose index scan in case of count(distinct ...). See http://dev.mysql.com/doc/refman/5.1/en/loose-index-scan.html. Please, check.
[18 Jul 2008 14:30] Zardosht Kasheff
I guess the main point is that select count (distinct) should perform just as efficiently as select distinct. Right now, select distinct is much more efficient. I could not quite tell from the link of select distinct is done with a loose index scan, but if so, then yes, that is what I am proposing.
[18 Jul 2008 20:05] Valeriy Kravchuk
I'd also like to request additional comment in EXPLAIN/EXPLAIN EXTENDED when/if loose index scan is used.
[7 Apr 2010 10:58] Georgi Kodinov
WL3220 that implements this is pushed to 5.5
[7 Apr 2010 14:54] Paul DuBois
Noted in 5.5.0 changelog.

SELECT COUNT(DISTINCT) was slow compared with SELECT DISTINCT. Now 
the server can use loose index scan for certain forms of aggregate
functions that use DISTINCT. See 
 http://dev.mysql.com/doc/refman/5.5/en/loose-index-scan.html