MySQL Bugs: #30283: Strange behavior with COUNT(DISTINCT())

Bug #30283	Strange behavior with COUNT(DISTINCT())
Submitted:	7 Aug 2007 19:28	Modified:	8 Aug 2007 7:13
Reporter:	Ben Vaughn	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S4 (Feature request)
Version:	5.1.20, 5.0.46	OS:	Any (Mac OS X Server 10.3.9)
Assigned to:		CPU Architecture:	Any
Tags:	select distinct slow speed table scan

Description:
It seems as if SELECT COUNT(DISTINCT()) is broken and not following any optimized route to get a quicker answer.  While SELECT DISTINCT m_date is using the index on the m_date field, SELECT COUNT(DISTINCT(m_date)) is causing full table scan.  Example below.

How to repeat:
CREATE TABLE `m_msg` (
  `m_md5` char(32) NOT NULL DEFAULT '00000000000000000000000000000000',
  `m_from` varchar(128) NOT NULL DEFAULT '',
  `m_to` varchar(128) NOT NULL DEFAULT '',
  `m_cc` text NOT NULL,
  `m_subject` text NOT NULL,
  `m_attach` tinyint(1) NOT NULL DEFAULT '0',
  `m_image` tinyint(1) NOT NULL DEFAULT '0',
  `m_date` date NOT NULL DEFAULT '0000-00-00',
  `m_time` time NOT NULL DEFAULT '00:00:00',
  `m_attach_list` blob NOT NULL,
  `m_src_dept` char(5) NOT NULL DEFAULT 'UNKWN',
  `m_dst_dept` char(5) NOT NULL DEFAULT 'UNKWN',
  `m_score` int(10) NOT NULL DEFAULT '0',
  `m_weight` mediumint(9) NOT NULL DEFAULT '0',
  PRIMARY KEY (`m_md5`,`m_date`),
  KEY `i_date` (`m_date`),
  KEY `i_from` (`m_from`(25)),
  KEY `i_to` (`m_to`(25)),
  KEY `i_attach` (`m_attach`),
  KEY `i_image` (`m_image`)
)

(insert 28 million rows, 200k rows per date in the m_date column)

14:22 mysql-5.1.20-beta-log (monitorng) [4]> explain select distinct(m_date) from m_msg;
+----+-------------+-------+-------+---------------+--------+---------+------+------+--------------------------+
| id | select_type | table | type  | possible_keys | key    | key_len | ref  | rows | Extra                    |
+----+-------------+-------+-------+---------------+--------+---------+------+------+--------------------------+
|  1 | SIMPLE      | m_msg | range | NULL          | i_date | 3       | NULL |   10 | Using index for group-by |
+----+-------------+-------+-------+---------------+--------+---------+------+------+--------------------------+
1 row in set (0.01 sec)

14:22 mysql-5.1.20-beta-log (monitorng) [5]> explain select count(distinct(m_date)) from m_msg;
+----+-------------+-------+-------+---------------+--------+---------+------+----------+-------------+
| id | select_type | table | type  | possible_keys | key    | key_len | ref  | rows     | Extra       |
+----+-------------+-------+-------+---------------+--------+---------+------+----------+-------------+
|  1 | SIMPLE      | m_msg | index | NULL          | i_date | 3       | NULL | 27680916 | Using index |
+----+-------------+-------+-------+---------------+--------+---------+------+----------+-------------+
1 row in set (0.00 sec)

14:23 mysql-5.1.20-beta-log (monitorng) [6]>

14:23 mysql-5.1.20-beta-log (monitorng) [6]> select distinct(m_date) from m_msg;
+------------+
| m_date     |
+------------+
| 2007-02-11 |
...
+------------+
178 rows in set (0.24 sec)

14:23 mysql-5.1.20-beta-log (monitorng) [7]>

14:23 mysql-5.1.20-beta-log (monitorng) [7]> select count(distinct(m_date)) from m_msg;
+-------------------------+
| count(distinct(m_date)) |
+-------------------------+
|                     178 |
+-------------------------+
1 row in set (3 min 44.74 sec)

14:27 mysql-5.1.20-beta-log (monitorng) [8]>

Suggested fix:
I'm not sure, unfortunately -- and it may not even be a bug...but I can't find anyone who thinks it is normal behavior.

Thank you for a problem report. Access pathes you reported verified just as described with a simple test case:

create table tdis(id int auto_increment primary key, d date);
insert into tdis(d) values (now());
insert into tdis(d) select date_sub(now(), interval 1000*rand() day) from tdis;
...
insert into tdis(d) select date_sub(now(), interval 1000*rand() day) from tdis;
alter table tdis add key i_date(d);

You have to insert many thousands of rows that way, with <= 1000 distinct d values.

"index" access path (full index scan) is used both for InnoDB and MyISAM tables:

mysql> explain select distinct(d) from tdis;
+----+-------------+-------+-------+---------------+--------+---------+------+--
----+--------------------------+
| id | select_type | table | type  | possible_keys | key    | key_len | ref  | r
ows | Extra                    |
+----+-------------+-------+-------+---------------+--------+---------+------+--
----+--------------------------+
|  1 | SIMPLE      | tdis  | range | NULL          | i_date | 4       | NULL | 1
025 | Using index for group-by |
+----+-------------+-------+-------+---------------+--------+---------+------+--
----+--------------------------+
1 row in set (0.00 sec)

mysql> explain select count(distinct d) from tdis;
+----+-------------+-------+-------+---------------+--------+---------+------+--
----+-------------+
| id | select_type | table | type  | possible_keys | key    | key_len | ref  | r
ows | Extra       |
+----+-------------+-------+-------+---------------+--------+---------+------+--
----+-------------+
|  1 | SIMPLE      | tdis  | index | NULL          | i_date | 4       | NULL | 2
048 | Using index |
+----+-------------+-------+-------+---------------+--------+---------+------+--
----+-------------+
1 row in set (0.00 sec)

This is at least a reasonable feature request, if special optimization for COUNT(DISTINCT col) is not there by design.

From any point of view this is a feature request. This optimization won't be implemented in 5.0/5.1 in any case.