Bug #1346 Finding Duplicates with GROUP BY
Submitted: 19 Sep 2003 5:03 Modified: 19 Sep 2003 7:45
Reporter: [ name withheld ] Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Command-line Clients Severity:S3 (Non-critical)
Version:4.0.12 OS:Windows (Windows 2000)
Assigned to: CPU Architecture:Any

[19 Sep 2003 5:03] [ name withheld ]
Description:
Ok.  I looked through google and saw a couple examples of how to find the duplicates in a table that looked like this...

SELECT (id) as n from emp_table GROUP BY id
HAVING n > 1;

Ok.  So I tried it.  I have a table of 13,128,178 rows (not a small example).  I ran a DISTINCT on the key field and the result was 12,787,768 so that tells me there are 340,410 duplicate rows.

Ok.  I run the above command.  I get 272,626 as an answer.  What?  That doesn't match.  For grins I run the query again with.

HAVING n > 2;

This time I get 67,756.
Again HAVING n > 3 I get 30.

Add them all up I get 340,413.  Now they match!  What seems to be happening is that the HAVING n > 1 is really doing n == 2 not GREATER than 1 and so on.

I would say this looks like a bug or I am crazy either of which could be true.

How to repeat:
Load a table with a few duplicates and triplicates and run the group by query.

Suggested fix:
What seems to be happening is that the HAVING n > 1 is really doing n == 2 not GREATER than 1 and so on.
[19 Sep 2003 7:45] Indrek Siitan
The correct way to look for rows that have duplicates is:
SELECT id, count(*) as n from emp_table GROUP BY id HAVING n > 1;
[19 Sep 2003 9:04] [ name withheld ]
SELECT id, count(*) as n from emp_table GROUP BY id HAVING n > 1;

I tried that way also.  Same problem.