Bug #8821 join to subquery without distinct causes bad COUNT
Submitted: 26 Feb 2005 12:00 Modified: 15 Jun 2005 9:12
Reporter: Are you mortal Then prepare to die. Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S2 (Serious)
Version:Ver 14.7 Distrib 4.1.9, for dec-osf5.1b OS:alphaev67
Assigned to: Oleksandr Byelkin CPU Architecture:Any

[26 Feb 2005 12:00] Are you mortal Then prepare to die.
Description:
Using an INNER JOIN on a subquery gives innacurate COUNT. 

Changing the INNER JOIN to a WHERE IN fixes the problem (but is much slower).

Using a DISTINCT with the INNER JOIN is just as quick, and fixes the COUNT.

How to repeat:
Here is a little example,

DROP   TABLE IF EXISTS bleah;
CREATE TABLE           bleah (
  CHAIN_ID INT(11) NOT NULL,
  PDB      CHAR(4) NOT NULL,
  #
  PRIMARY KEY (CHAIN_ID),
  INDEX       (PDB)
);

NSERT INTO bleah VALUES (1,'WABA');
INSERT INTO bleah VALUES (2,'WABA');
INSERT INTO bleah VALUES (3,'WABA');
INSERT INTO bleah VALUES (4,'HABA');
INSERT INTO bleah VALUES (12,'HABA');
INSERT INTO bleah VALUES (13,'HABA');
INSERT INTO bleah VALUES (14,'BNGO');

INNER JOIN version gives incorrect COUNT...

SELECT COUNT(*), COUNT(DISTINCT CHAIN_ID) FROM bleah
INNER JOIN
(
  SELECT PDB FROM bleah
)
AS virtualTable
USING (PDB)
;

+----------+--------------------------+
| COUNT(*) | COUNT(DISTINCT CHAIN_ID) |
+----------+--------------------------+
|       19 |                        7 |
+----------+--------------------------+
1 row in set (0.04 sec)

WHERE IN version gives correct COUNT...
But is significantly slower

SELECT COUNT(*), COUNT(DISTINCT CHAIN_ID) FROM bleah
WHERE PDB IN 
(
  SELECT PDB FROM bleah
)
;

+----------+--------------------------+
| COUNT(*) | COUNT(DISTINCT CHAIN_ID) |
+----------+--------------------------+
|        7 |                        7 |
+----------+--------------------------+
1 row in set (0.09 sec)

INNER JOIN version with DISTINCT gives correct COUNT...
And is just as fast

SELECT COUNT(*), COUNT(DISTINCT CHAIN_ID) FROM bleah
INNER JOIN
(
  SELECT DISTINCT PDB FROM bleah
)
AS virtualTable
USING (PDB)
;

+----------+--------------------------+
| COUNT(*) | COUNT(DISTINCT CHAIN_ID) |
+----------+--------------------------+
|        7 |                        7 |
+----------+--------------------------+
1 row in set (0.02 sec)

Suggested fix:

Not sure why the INNER JOIN is faster than the WHERE IN (on a big table this has a big difference). Not sure if the DISTINCT is doing something usefull.

Perhaps this is a bug in my thinking, but it would be good if this example were put in the documentaion for subqueries, to make it clear what is going on. People who use subqueries may not notice the counts going astray. 

I still can't work out why not making it distinct and using the inner join has this effect. Could it be because the table is linking against itself?
[26 Feb 2005 12:19] MySQL Verification Team
Verified with 4.1.11-debug-log
[15 Jun 2005 9:10] Oleksandr Byelkin
Thank you for bugreport!
It is hot a bug, because 'SELECT PDB FROM bleah;' and 'SELECT DISTINCT PDB FROM bleah;' gives different results:
+ SELECT PDB FROM bleah;
+ PDB
+ BNGO
+ HABA
+ HABA
+ HABA
+ WABA
+ WABA
+ WABA
+ SELECT DISTINCT PDB FROM bleah;
+ PDB
+ BNGO
+ HABA
+ WABA

IN version of that query is equal to query with DISTINCT.
Just replace count in select list with * and you will see differ:
+ SELECT * FROM bleah
+ INNER JOIN
+ (
+ SELECT PDB FROM bleah
+ )
+ AS virtualTable
+ USING (PDB);
+ CHAIN_ID      PDB     PDB
+ 1     WABA    WABA
+ 1     WABA    WABA
+ 1     WABA    WABA
+ 2     WABA    WABA
+ 2     WABA    WABA
+ 2     WABA    WABA
+ 3     WABA    WABA
+ 3     WABA    WABA
+ 3     WABA    WABA
+ 4     HABA    HABA
+ 4     HABA    HABA
+ 4     HABA    HABA
+ 12    HABA    HABA
+ 12    HABA    HABA
+ 12    HABA    HABA
+ 13    HABA    HABA
+ 13    HABA    HABA
+ 13    HABA    HABA
+ 14    BNGO    BNGO

+ SELECT * FROM bleah
+ INNER JOIN
+ (
+ SELECT DISTINCT PDB FROM bleah
+ )
+ AS virtualTable
+ USING (PDB);
+ CHAIN_ID      PDB     PDB
+ 1     WABA    WABA
+ 2     WABA    WABA
+ 3     WABA    WABA
+ 4     HABA    HABA
+ 12    HABA    HABA
+ 13    HABA    HABA
+ 14    BNGO    BNGO

i.e. query without DISTINCT in subquery is not equivalent to query with IN subquery and should provide differ results.