MySQL Bugs: #11812: count distinct performance

Bug #11812	count distinct performance
Submitted:	8 Jul 2005 12:44	Modified:	27 Apr 2006 14:23
Reporter:	heidi savin	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server	Severity:	S2 (Serious)
Version:	5.0	OS:	Linux (Linux Fedora Core 4 smp)
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
I used a table that has 4 columns, all of them which are integer(10). There are about 13 million records in the table. I'm trying to determine the number of rows with distinct values for col1. The table is indexed by col1.

If I do select count(distinct col1) from table, the query takes between 50 minutes to an hour to run.

I also tried a work-around which which is select sum(one) from (select (1) as one from table group by col1) as f2. This query produced the same results as the first one, but took only 30 seconds to run!!! 

This is all under the myISAM engine.

I posted a discussion topic about this and there is some excellent analysis on it by Jay Pipes. The topic is at http://forums.mysql.com/read.php?24,32994,32994#msg-32994

How to repeat:
Create table with 4 columns, all integer(10) with over a million rows. Index the table by col1. 

Run select count(distinct col1) from table and select sum(one) from (select (1) as one from table group by col1) as f2 and compare the results.

Suggested fix:
replace the function that performs count(distinct)  with select sum(one) from (select (1) as one from table group by col1) as f2 and compare the results.

Hi,

I'm trying to reproduce having 8 million rows and 8 distinct values in col1. I have approximatly same results for both queries.
How many distinct values you have?

Geert

i have about 9 million distinct records of 13 million total.

So, Heidi, you have 9 million distinct values among 13 million rows. OK.

(In the forum discussion you pointed out, col1 is a primary key, so there is no need to count(distinct col1) at all - you may just count(*) and get that perfect "Select tables optimized away" plan. But the plans you discussed in this topic looks really strange.)

I tried to reproduce your case (with non-unique col1) on 4.1.12-nt with a small number of rows and got the following results:

CREATE TABLE `t11812` (
  `c1` int(10) default NULL,
  `c2` int(10) default NULL,
  `c3` int(10) default NULL,
  `c4` int(10) default NULL,
  KEY `c1_ix` (`c1`)
) ENGINE=MyISAM;

insert into t11812 (c1) values(1);
insert into t11812 (c1) values(2);
insert into t11812 (c1) values(3);
insert into t11812 (c1) values(4);
insert into t11812 (c1) values(5);
insert into t11812 (c1) values(6);
insert into t11812 (c1) values(7);
insert into t11812 (c1) values(8);
insert into t11812 (c1) values(9);

insert into t11812 select * from t11812 where c1 > 8;
insert into t11812 select * from t11812 where c1 > 8;
insert into t11812 select * from t11812 where c1 = 1;

mysql> select count(*) from t11812;
+----------+
| count(*) |
+----------+
|       13 |
+----------+
1 row in set (0.00 sec)

So, I have the same ratio of distinct values to total.

analyze table t11812;

mysql> explain select count(distinct c1) from t11812;
+----+-------------+--------+-------+---------------+-------+---------+------+--
----+-------------+
| id | select_type | table  | type  | possible_keys | key   | key_len | ref  | r
ows | Extra       |
+----+-------------+--------+-------+---------------+-------+---------+------+--
----+-------------+
|  1 | SIMPLE      | t11812 | index | NULL          | c1_ix |       5 | NULL |
 13 | Using index |
+----+-------------+--------+-------+---------------+-------+---------+------+--
----+-------------+
1 row in set (0.01 sec)

mysql> explain select sum(one) from (select 1 as one from t11812 group by c1) as
 f2;
+----+-------------+------------+-------+---------------+-------+---------+-----
-+------+-------------+
| id | select_type | table      | type  | possible_keys | key   | key_len | ref
 | rows | Extra       |
+----+-------------+------------+-------+---------------+-------+---------+-----
-+------+-------------+
|  1 | PRIMARY     | <derived2> | ALL   | NULL          | NULL  |    NULL | NULL
 |    9 |             |
|  2 | DERIVED     | t11812     | index | NULL          | c1_ix |       5 | NULL
 |   13 | Using index |
+----+-------------+------------+-------+---------------+-------+---------+-----
-+------+-------------+
2 rows in set (0.00 sec)

Both queries use index on c1, and second one has one small additional step. So, I do not understand, how optimizer can do something different with 13 million rows and how second query can be much better then the first one.

Would you, please, give the results of explain for your real queries. Copy-paste them, like I did. It will be hard for us to verify and fix this possible bug without a reproducable test case.

There may have been a misunderstanding in the discussion, my col1 was not a primary key so I did have to use the count(distinct) option.

We've done some clean up on our tables and the number of rows that we have has significantly decreased. I ran the queries again and the select count(distinct) did not take nearly as long as it once did however it is still about 1/2 the speed as my workaround.

I think that the odd results have a lot to do with the number of records in the table so that is why you are not getting the same results as I was. You might want to rerun the experiment with 13 million rows and see if you get the same results. I don't think that the ratio matters as much as the size of the table.

I'm leaving for vacation today, but when I get back next week I can try to make a table with 13 million rows and tell you what results I get. I invite you to try the same thing.

The queries I did today: (I cannot copy and paste the results due to some internal issues)

explain select sum(one) from (select 1 as one from mytable group by col1) as fp;
1 | PRIMARY | <dervived2> | ALL | NULL | NULL | NULL | NULL | 623000
2 | DERIVED | mytable | index | NULL | index_col1 | 5 | NULL | 5100000 | Using index |

explain select count(distinct col1) from mytable;
1 | SIMPLE | mytable | index | NULL | index_col1 | 5 | NULL | 5100000 | Using index |

select sum(one) from (select 1 as one from mytable group by col1) as fp;
sum(one)
623000 (6.60 sec)

select count(distinct col1) from mytable;
count(distinct col1)
623000 (13.68 sec)

4.1 should not be considered in this discussion as in 5.0 the implementation of count(distinct) was substantionally improved.
What are the results in 5.0 you get?

I haven't had a chance to test it on the 5.0 version. On your website, it says that 4.1 is the preferred version and 5.0 is for testing and previewing new features. When is the 5.0 version going to be out of beta and the preferred version? How did you improve the count distinct method?

Could you please test with current 5.0.17 release version.

Thanks in advance.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

I have a table with 40 columns, 721108 rows, there is 58 distinct col1 and col1 is indexed.

select count(distinct(col1)) from table; (1 min 1.10 sec)

select sum(one) from (select (1) as one from table group by col1) as foo; (11.74 sec)

so I end up using "select distinct(col1) from table" which takes (0.21 sec), then get the row count from there.

I am using 5.0.18

Could you please submit a reproducible test case for the problem?
An SQL stored procedure or a script would be enough.
Thank you.

Could you please provide the test case asked by Konstantin?
Thanks in advance.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".