Bug #50539 GROUP BY + COUNT(DISTINCT ) on an int column shows results off by one
Submitted: 22 Jan 2010 10:36 Modified: 7 Mar 2010 1:12
Reporter: Benjamin Schuster-Böckler Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S2 (Serious)
Version:5.5.0, 5.5.2-m2, 6.0.14-bzr OS:Any (64bit)
Assigned to: Evgeny Potemkin CPU Architecture:Any
Tags: regression
Triage: Triaged: D2 (Serious)

[22 Jan 2010 10:36] Benjamin Schuster-Böckler
Description:
The following query

SELECT chromosome, COUNT(DISTINCT position)
FROM melanoma_snps
GROUP BY chromosome

shows the "chromosome" column off by 1 (it should range from 1 to 23):

2	277
3	471
4	404
5	438
6	209
7	335
8	339
9	357
10	205
11	179
12	144
13	274
14	232
15	113
16	90
17	114
18	111
19	139
20	87
21	112
22	81
23	74
23	199

To verify:

SELECT COUNT(DISTINCT position)
FROM melanoma_snps
WHERE chromosome = 1

returns:

277

I tried FLUSHing all tables as well as re-creating table in new clean database. Still persists.

How to repeat:
Create a new database and the following table:

CREATE TABLE `melanoma_snps` (
  `chromosome` int(11) NOT NULL DEFAULT '0',
  `position` int(10) unsigned NOT NULL DEFAULT '0',
  `reference` char(1) DEFAULT NULL,
  `mutant` char(1) NOT NULL DEFAULT '',
  `zygosity` enum('het','hom') DEFAULT NULL,
  `effect` enum('Intergenic','Intronic','Missense','Noncoding_RNA','Nonsense','Silent','Splice','UTR') DEFAULT NULL,
  `validation` enum('"REAL, STATUS UNKNOWN"','COSMIC','N/A','SOMATIC','UNCONFIRMED') DEFAULT NULL,
  PRIMARY KEY (`chromosome`,`position`,`mutant`),
  KEY `k1` (`mutant`),
  KEY `k2` (`reference`),
  KEY `k3` (`zygosity`),
  KEY `k4` (`position`),
  KEY `k5` (`chromosome`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1

Download this file to fill the table (33K): http://www.dagams.de/melanoma_snps.red.gz

Perform the query as above.
[22 Jan 2010 11:18] Valeriy Kravchuk
Verified with recent trunk tree from bzr:

openxs@suse:/home2/openxs/dbs/trunk> bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.5.2-m2-debug Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> load data infile '~/melanoma_snps.red' into table  `melanoma_snps` ;
Query OK, 4985 rows affected (0.31 sec)
Records: 4985  Deleted: 0  Skipped: 0  Warnings: 0

mysql> SELECT COUNT(DISTINCT position)
    -> FROM melanoma_snps
    -> WHERE chromosome = 1;
+--------------------------+
| COUNT(DISTINCT position) |
+--------------------------+
|                      278 |
+--------------------------+
1 row in set (0.05 sec)

mysql> SELECT chromosome, COUNT(DISTINCT position)
    -> FROM melanoma_snps
    -> GROUP BY chromosome
    -> ;
+------------+--------------------------+
| chromosome | COUNT(DISTINCT position) |
+------------+--------------------------+
|          2 |                      278 |
|          3 |                      471 |
|          4 |                      404 |
|          5 |                      438 |
|          6 |                      209 |
|          7 |                      335 |
|          8 |                      339 |
|          9 |                      357 |
|         10 |                      205 |
|         11 |                      179 |
|         12 |                      144 |
|         13 |                      274 |
|         14 |                      232 |
|         15 |                      113 |
|         16 |                       90 |
|         17 |                      114 |
|         18 |                      111 |
|         19 |                      139 |
|         20 |                       87 |
|         21 |                      112 |
|         22 |                       81 |
|         23 |                       74 |
|         23 |                      199 |
+------------+--------------------------+
23 rows in set (0.11 sec)

This is a regression bug, as 5.0.90 shows:

openxs@suse:/home2/openxs/dbs/5.0> bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.0.90-debug Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> CREATE TABLE `melanoma_snps` (   `chromosome` int(11) NOT NULL DEFAULT '0',   `position` int(10) unsigned NOT NULL DEFAULT '0',   `reference` char(1) DEFAULT NULL,   `mutant` char(1) NOT NULL DEFAULT '',   `zygosity` enum('het','hom') DEFAULT NULL,   `effect` enum('Intergenic','Intronic','Missense','Noncoding_RNA','Nonsense','Silent','Splice','UTR') DEFAULT NULL,   `validation` enum('"REAL, STATUS UNKNOWN"','COSMIC','N/A','SOMATIC','UNCONFIRMED') DEFAULT NULL,   PRIMARY KEY (`chromosome`,`position`,`mutant`),   KEY `k1` (`mutant`),   KEY `k2` (`reference`),   KEY `k3` (`zygosity`),   KEY `k4` (`position`),   KEY `k5` (`chromosome`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (0.08 sec)

mysql> load data infile '~/melanoma_snps.red' into table  `melanoma_snps` ;
Query OK, 4985 rows affected (0.22 sec)
Records: 4985  Deleted: 0  Skipped: 0  Warnings: 0

mysql> SELECT chromosome, COUNT(DISTINCT position) FROM melanoma_snps GROUP BY chromosome;
+------------+--------------------------+
| chromosome | COUNT(DISTINCT position) |
+------------+--------------------------+
|          1 |                      278 |
|          2 |                      471 |
|          3 |                      404 |
|          4 |                      438 |
|          5 |                      209 |
|          6 |                      335 |
|          7 |                      339 |
|          8 |                      357 |
|          9 |                      205 |
|         10 |                      179 |
|         11 |                      144 |
|         12 |                      274 |
|         13 |                      232 |
|         14 |                      113 |
|         15 |                       90 |
|         16 |                      114 |
|         17 |                      111 |
|         18 |                      139 |
|         19 |                       87 |
|         20 |                      112 |
|         21 |                       81 |
|         22 |                       74 |
|         23 |                      199 |
+------------+--------------------------+
23 rows in set (0.04 sec)
[22 Jan 2010 11:19] Valeriy Kravchuk
5.1.44 is also NOT affected.
[27 Jan 2010 12:16] Manyi Lu
Valeriy, could you please check whether this bug can be reproduced in 6.0-codebase-bugfixing or the trunk?

Thanks,
Manyi
[27 Jan 2010 13:30] Valeriy Kravchuk
Bug is verified on trunk tree. Do you mean next-mr? 

Cloning mysql-6.0-codebase-bugfixing now...
[27 Jan 2010 14:27] Valeriy Kravchuk
mysql-6.0-codebase is also affected:

77-52-24-143:6.0-codebase openxs$ bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 6.0.14-alpha-debug Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> load data infile '~/Downloads/melanoma_snps.red' into table  `melanoma_snps` ;
Query OK, 4985 rows affected (0.18 sec)
Records: 4985  Deleted: 0  Skipped: 0  Warnings: 0

mysql> SELECT COUNT(DISTINCT position)
    -> FROM melanoma_snps
    -> WHERE chromosome = 1;
+--------------------------+
| COUNT(DISTINCT position) |
+--------------------------+
|                      278 |
+--------------------------+
1 row in set (0.00 sec)

mysql> SELECT chromosome, COUNT(DISTINCT position)
    -> FROM melanoma_snps
    -> GROUP BY chromosome;
+------------+--------------------------+
| chromosome | COUNT(DISTINCT position) |
+------------+--------------------------+
|          2 |                      278 |
|          3 |                      471 |
|          4 |                      404 |
|          5 |                      438 |
|          6 |                      209 |
|          7 |                      335 |
|          8 |                      339 |
|          9 |                      357 |
|         10 |                      205 |
|         11 |                      179 |
|         12 |                      144 |
|         13 |                      274 |
|         14 |                      232 |
|         15 |                      113 |
|         16 |                       90 |
|         17 |                      114 |
|         18 |                      111 |
|         19 |                      139 |
|         20 |                       87 |
|         21 |                      112 |
|         22 |                       81 |
|         23 |                       74 |
|         23 |                      199 |
+------------+--------------------------+
23 rows in set (0.09 sec)
[8 Feb 2010 16:04] Patrick Crews
From initial tests against this bug, not all of the int column values are being displayed at +1 to true value:
mysql> insert into melanoma_snps values (1,2,'a','a','het','','');
Query OK, 1 row affected, 2 warnings (0.06 sec)

mysql> insert into melanoma_snps values (2,2,'a','a','het','','');
Query OK, 1 row affected, 2 warnings (0.00 sec)

mysql> insert into melanoma_snps values (3,2,'a','a','het','','');
Query OK, 1 row affected, 2 warnings (0.00 sec)

mysql> insert into melanoma_snps values (2,2,'a','a','het','','');
ERROR 1062 (23000): Duplicate entry '2-2-a' for key 'PRIMARY'
mysql> insert into melanoma_snps values (2,3,'a','a','het','','');
Query OK, 1 row affected, 2 warnings (0.00 sec)

mysql> insert into melanoma_snps values (1,3,'a','a','het','','');
Query OK, 1 row affected, 2 warnings (0.00 sec)

mysql> insert into melanoma_snps values (1,1,'a','a','het','','');
Query OK, 1 row affected, 2 warnings (0.00 sec)

mysql> insert into melanoma_snps values (1,5,'a','a','het','','');
Query OK, 1 row affected, 2 warnings (0.00 sec)

mysql> SELECT chromosome, COUNT(DISTINCT position)
    -> FROM melanoma_snps
    -> GROUP BY chromosome
    -> ;
+------------+--------------------------+
| chromosome | COUNT(DISTINCT position) |
+------------+--------------------------+
|          2 |                        4 |
|          3 |                        2 |
|          3 |                        1 |
+------------+--------------------------+
3 rows in set (0.04 sec)

Note that the row for chromosome 3 (last row) shows the correct chromosome value.  Adding additional rows after this did not cause the int value to increment.
[8 Feb 2010 16:09] Patrick Crews
NOTE:  The last value of chromosome is not incremented.  We can see that the value 23 is repeated 2x in the original bug report.  So, it seems that everything but MAX(int_column) is incremented while MAX(int_column remains as it should be)

mysql> insert into melanoma_snps values (1,1,'a','a','het','','');Query OK, 1 row affected, 2 warnings (0.00 sec)

mysql> insert into melanoma_snps values (1,2,'a','a','het','','');
Query OK, 1 row affected, 2 warnings (0.00 sec)

mysql> insert into melanoma_snps values (1,3,'a','a','het','','');
Query OK, 1 row affected, 2 warnings (0.00 sec)

mysql> insert into melanoma_snps values (1,4,'a','a','het','','');
Query OK, 1 row affected, 2 warnings (0.00 sec)

mysql> insert into melanoma_snps values (2,1,'a','a','het','','');
Query OK, 1 row affected, 2 warnings (0.00 sec)

mysql> SELECT chromosome, COUNT(DISTINCT position) FROM melanoma_snps GROUP BY chromosome;
+------------+--------------------------+
| chromosome | COUNT(DISTINCT position) |
+------------+--------------------------+
|          2 |                        4 |
|          2 |                        1 |
+------------+--------------------------+
2 rows in set (0.00 sec)

mysql> insert into melanoma_snps values (3,1,'a','a','het','','');Query OK, 1 row affected, 2 warnings (0.00 sec)

mysql> SELECT chromosome, COUNT(DISTINCT position) FROM melanoma_snps GROUP BY chromosome;
+------------+--------------------------+
| chromosome | COUNT(DISTINCT position) |
+------------+--------------------------+
|          2 |                        4 |
|          3 |                        1 |
|          3 |                        1 |
+------------+--------------------------+
3 rows in set (0.00 sec)

mysql> insert into melanoma_snps values (4,1,'a','a','het','','');Query OK, 1 row affected, 2 warnings (0.00 sec)

mysql> SELECT chromosome, COUNT(DISTINCT position) FROM melanoma_snps GROUP BY chromosome;
+------------+--------------------------+
| chromosome | COUNT(DISTINCT position) |
+------------+--------------------------+
|          2 |                        4 |
|          3 |                        1 |
|          4 |                        1 |
|          4 |                        1 |
+------------+--------------------------+
4 rows in set (0.00 sec)
[8 Feb 2010 18:44] Patrick Crews
This bug also needs a multi-part key (see the PRIMARY KEY statement) on the table, and the SELECT statement needs to have an aggregate + DISTINCT (without DISTINCT appears fine) on the 2nd part of the key with a GROUP BY on the first part of the key.
[10 Feb 2010 13:43] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/99821

2991 Evgeny Potemkin	2010-02-10
      Bug#50539: Wrong result when loose index scan is used for an aggregate
                 function with distinct.
      Loose index scan is used to find MIN/MAX values using appropriate index and
      thus allow to avoid grouping. For each found row it updates non-aggregated
      fields with values from row with found MIN/MAX value.
      WL#3220 allows to use loose index scan to speed up queries with aggregate
      functions with distinct. In such cases it doesn't calculate the result of
      aggregate functions but simply jumps to a next distinct value. This scenario
      wasn't taken into account by the QUICK_GROUP_MIN_MAX_SELECT::get_next and it
      was updating non aggregated fields to values from current row which in this
      case might belong to the next group.
      
      The QUICK_GROUP_MIN_MAX_SELECT::get_next now updates non-aggregated fields
      only if it's not used to speedup aggregate function with distinct.
     @ mysql-test/r/group_min_max.result
        Added a test case for the bug#50539.
     @ mysql-test/t/group_min_max.test
        Added a test case for the bug#50539.
     @ sql/opt_range.cc
        Bug#50539: Wrong result when loose index scan is used for an aggregate
        function with distinct.
        The QUICK_GROUP_MIN_MAX_SELECT::get_next now updates non-aggregated fields
        only if it's not used to speedup aggregate function with distinct.
[11 Feb 2010 16:34] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/99976

2991 Evgeny Potemkin	2010-02-11
      Bug#50539: Wrong result when loose index scan is used for an aggregate
                 function with distinct.
      Loose index scan is used to find MIN/MAX values using appropriate index and
      thus allow to avoid grouping. For each found row it updates non-aggregated
      fields with values from row with found MIN/MAX value.
      Without loose index scan non-aggregated fields are copied by end_send_group
      function. With loose index scan there is no need in end_send_group and
      end_send is used instead. Non-aggregated fields still need to be copied and
      this was wrongly implemented in QUICK_GROUP_MIN_MAX_SELECT::get_next.
      WL#3220 added a case when loose index scan can be used with end_send_group to
      optimize calculation of aggregate functions with distinct. In this case
      the row found by QUICK_GROUP_MIN_MAX_SELECT::get_next might belong to a next
      group and copying it will produce wrong result.
      
      Update of non-aggregated fields is moved to the end_send function from
      QUICK_GROUP_MIN_MAX_SELECT::get_next.
     @ mysql-test/r/group_min_max.result
        Added a test case for the bug#50539.
     @ mysql-test/t/group_min_max.test
        Added a test case for the bug#50539.
     @ sql/opt_range.cc
        Bug#50539: Wrong result when loose index scan is used for an aggregate
        function with distinct.
        Update of non-aggregated fields is moved to the end_send function from
        QUICK_GROUP_MIN_MAX_SELECT::get_next.
     @ sql/sql_select.cc
        Bug#50539: Wrong result when loose index scan is used for an aggregate
                   function with distinct.
        Update of non-aggregated fields is moved to the end_send function from
        QUICK_GROUP_MIN_MAX_SELECT::get_next.
[11 Feb 2010 17:21] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/99986

2996 Evgeny Potemkin	2010-02-11 [merge]
      Auto-merged fix for the bug#50539.
[12 Feb 2010 8:52] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/100081

2943 Evgeny Potemkin	2010-02-12
      Bug#50539: Wrong result when loose index scan is used for an aggregate
                 function with distinct.
      Loose index scan is used to find MIN/MAX values using appropriate index and
      thus allow to avoid grouping. For each found row it updates non-aggregated
      fields with values from row with found MIN/MAX value.
      Without loose index scan non-aggregated fields are copied by end_send_group
      function. With loose index scan there is no need in end_send_group and
      end_send is used instead. Non-aggregated fields still need to be copied and
      this was wrongly implemented in QUICK_GROUP_MIN_MAX_SELECT::get_next.
      WL#3220 added a case when loose index scan can be used with end_send_group to
      optimize calculation of aggregate functions with distinct. In this case
      the row found by QUICK_GROUP_MIN_MAX_SELECT::get_next might belong to a next
      group and copying it will produce wrong result.
      
      Update of non-aggregated fields is moved to the end_send function from
      QUICK_GROUP_MIN_MAX_SELECT::get_next.
     @ mysql-test/r/group_min_max.result
        Added a test case for the bug#50539.
     @ mysql-test/t/group_min_max.test
        Added a test case for the bug#50539.
     @ sql/opt_range.cc
        Bug#50539: Wrong result when loose index scan is used for an aggregate
        function with distinct.
        Update of non-aggregated fields is moved to the end_send function from
        QUICK_GROUP_MIN_MAX_SELECT::get_next.
     @ sql/sql_select.cc
        Bug#50539: Wrong result when loose index scan is used for an aggregate
        function with distinct.
        Update of non-aggregated fields is moved to the end_send function from
        QUICK_GROUP_MIN_MAX_SELECT::get_next.
[12 Feb 2010 9:42] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/100086

2999 Evgeny Potemkin	2010-02-12 [merge]
      Auto-merged fix for the bug#50539.
[12 Feb 2010 9:50] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/100090

3883 Evgeny Potemkin	2010-02-12 [merge]
      Auto-merged fix for bug#50539.
[12 Feb 2010 17:40] Bugs System
Pushed into 5.5.2-m2 (revid:joerg@mysql.com-20100212164100-jnurxdw5z88m472s) (version source revid:joerg@mysql.com-20100212164100-jnurxdw5z88m472s) (merge vers: 5.5.2-m2) (pib:16)
[13 Feb 2010 8:36] Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20100213083436-9pesg4h55w1mekxc) (version source revid:alik@sun.com-20100212100039-eh6atbvijcm71eyb) (merge vers: 6.0.14-alpha) (pib:16)
[13 Feb 2010 8:39] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100213083327-cee4ao3jpg33eggv) (version source revid:alik@sun.com-20100212095912-k3fklaqrxzzls9cd) (pib:16)
[25 Feb 2010 0:36] Paul Dubois
Noted in 5.5.2, 6.0.14 changelogs.

Use of loose index scan optimization for an aggregate function with
DISTINCT (for example, COUNT(DISTINCT)) could produce incorrect
results. 

Setting report to Need Merge pending push of Celosia into release tree.
[6 Mar 2010 11:06] Bugs System
Pushed into 5.5.3-m3 (revid:alik@sun.com-20100306103849-hha31z2enhh7jwt3) (version source revid:vvaintroub@mysql.com-20100213160132-nx1vlocxuta76txh) (merge vers: 5.5.99-m3) (pib:16)
[7 Mar 2010 1:12] Paul Dubois
Already fixed in earlier 5.5.x.