MySQL Bugs: #38213: MySQL does not seem to execute select count (distinct) efficiently

Bug #38213	MySQL does not seem to execute select count (distinct) efficiently
Submitted:	17 Jul 2008 20:46	Modified:	7 Apr 2010 14:54
Reporter:	Zardosht Kasheff (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S3 (Non-critical)
Version:	5.1.23-rc	OS:	Any
Assigned to:	Georgi Kodinov	CPU Architecture:	Any

Description:
take a table with the following definition:

| z1    | CREATE TABLE `z1` (
  `a` int(11) NOT NULL DEFAULT '0',
  `b` int(11) NOT NULL DEFAULT '0',
  `c` int(11) DEFAULT NULL,
  `d` int(11) DEFAULT NULL,
  PRIMARY KEY (`a`,`b`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |

and take the following query:

select count(distinct a) from z1;

This query seems to execute inefficiently. 

Suppose one has a table with 1 million rows where the primary keys are (1,1), (1,2), ...(1,1000),(2,1), (2,2)...(2,1000), ...(1000,1)...(1000,1000). When one does the following query: "select distinct a from z1;" this query runs very fast. The reason this runs very fast is that MySQL makes successive index_read calls to the handler with the flag HA_READ_AFTER_KEY. However, when executing "select count(distinct a) from z1;", MySQL makes successive index_next calls to the handler. The difference here is that the index_read calls can skip over duplicate values of 'a', whereas the index_next call is processing them. This seems inefficient. The example above shows a dramatic difference in performance between the two query executions.

The resulting behavior is that the query "select distinct a ..." runs noticeably faster than "select count (distinct a)...".

How to repeat:
explained above

Suggested fix:
"select count (distinct a) ..." should make successive index_read calls with the flag HA_READ_AFTER_KEY.

Thank you for a problem report. It sounds like a reasonable feature request to apply a loose index scan in case of count(distinct ...). See http://dev.mysql.com/doc/refman/5.1/en/loose-index-scan.html. Please, check.

I guess the main point is that select count (distinct) should perform just as efficiently as select distinct. Right now, select distinct is much more efficient. I could not quite tell from the link of select distinct is done with a loose index scan, but if so, then yes, that is what I am proposing.

I'd also like to request additional comment in EXPLAIN/EXPLAIN EXTENDED when/if loose index scan is used.

WL3220 that implements this is pushed to 5.5

Noted in 5.5.0 changelog.

SELECT COUNT(DISTINCT) was slow compared with SELECT DISTINCT. Now 
the server can use loose index scan for certain forms of aggregate
functions that use DISTINCT. See 
 http://dev.mysql.com/doc/refman/5.5/en/loose-index-scan.html