MySQL Bugs: #45640: optimizer bug produces wrong results

Bug #45640	optimizer bug produces wrong results
Submitted:	21 Jun 2009 14:35	Modified:	20 Jun 2010 22:28
Reporter:	Matthew Lord	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S1 (Critical)
Version:	5.0.84-bzr, 5.1.35, 5.1.36-bzr, 5.4.4-bzr	OS:	Any
Assigned to:	Gleb Shchepa	CPU Architecture:	Any

Description:
An optimizer bug appears to be affecting the results of a GROUP BY based on
the count() call used.

How to repeat:
drop table if exists DIM_test;

create table DIM_test (id int unsigned auto_increment primary key, name varchar(128), unique (name)) engine=myisam;

insert into DIM_test VALUES
(NULL, 'abcdefgh'),
(NULL, 'abcdefg'),
(NULL, 'abcdefgk'),
(NULL, 'abcdefgl'),
(NULL,'abcdefgm'),
(NULL, 'abcdefgn');
drop table if exists test;

create table test (name varchar(128), val1 int unsigned, val2 int unsigned, val3 int unsigned, index(name(16))) engine=myisam;

insert into test VALUES
(NULL, 48, 12, 44454322),
('abcdefgh',48, 12, 24333422),
('abcdefgh',48, 12, 24333422),
('abcdefgk', 32, 12, 4422422),
('abcdefgm', 48, 12, 999248),
(NULL, 48, 12, 828290),
(NULL, 48, 12, 828290),
(NULL, 48, 12, 442242),
('abcdefgh', 48, 12, 88892),
(NULL, 48, 12, 30),
(NULL, 48, 12, 442244),
('abcdefgk', 48, 12, 88892);

select * from DIM_test;

select * from test;

select if(isnull(name), 0, name) n, val1, val2, count(*)
from test
group by n, val1, val2;

select if(isnull(name), 0, name) n, val1, val2, count(distinct val3)
from test
group by n, val1, val2;

select if(isnull(name), 0, (select id from DIM_test where DIM_test.name=test.name)) n, val1, val2, count(*)
from test
group by n, val1, val2;

select if(isnull(name), 0, (select id from DIM_test where DIM_test.name=test.name)) n, val1, val2, count(distinct val2)
from test
group by n, val1, val2;

Thank you for the bug report. Verified just as described (there should be distinct val3 in the last query):

valeriy-kravchuks-macbook-pro:5.1 openxs$ bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 5
Server version: 5.1.36-debug-log Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> drop table if exists DIM_test;
Query OK, 0 rows affected (0.00 sec)

mysql> create table DIM_test (id int unsigned auto_increment primary key, name varchar(128),
    -> unique (name)) engine=myisam;
Query OK, 0 rows affected (0.07 sec)

mysql> insert into DIM_test VALUES
    -> (NULL, 'abcdefgh'),
    -> (NULL, 'abcdefg'),
    -> (NULL, 'abcdefgk'),
    -> (NULL, 'abcdefgl'),
    -> (NULL,'abcdefgm'),
    -> (NULL, 'abcdefgn');
Query OK, 6 rows affected (0.01 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> drop table if exists test;
Query OK, 0 rows affected (0.00 sec)

mysql> create table test (name varchar(128), val1 int unsigned, val2 int unsigned, val3 int
    -> unsigned, index(name(16))) engine=myisam;
Query OK, 0 rows affected (0.14 sec)

mysql> insert into test VALUES
    -> (NULL, 48, 12, 44454322),
    -> ('abcdefgh',48, 12, 24333422),
    -> ('abcdefgh',48, 12, 24333422),
    -> ('abcdefgk', 32, 12, 4422422),
    -> ('abcdefgm', 48, 12, 999248),
    -> (NULL, 48, 12, 828290),
    -> (NULL, 48, 12, 828290),
    -> (NULL, 48, 12, 442242),
    -> ('abcdefgh', 48, 12, 88892),
    -> (NULL, 48, 12, 30),
    -> (NULL, 48, 12, 442244),
    -> ('abcdefgk', 48, 12, 88892);
Query OK, 12 rows affected (0.00 sec)
Records: 12  Duplicates: 0  Warnings: 0

mysql> select * from DIM_test;
+----+----------+
| id | name     |
+----+----------+
|  1 | abcdefgh | 
|  2 | abcdefg  | 
|  3 | abcdefgk | 
|  4 | abcdefgl | 
|  5 | abcdefgm | 
|  6 | abcdefgn | 
+----+----------+
6 rows in set (0.00 sec)

mysql> select * from test;
+----------+------+------+----------+
| name     | val1 | val2 | val3     |
+----------+------+------+----------+
| NULL     |   48 |   12 | 44454322 | 
| abcdefgh |   48 |   12 | 24333422 | 
| abcdefgh |   48 |   12 | 24333422 | 
| abcdefgk |   32 |   12 |  4422422 | 
| abcdefgm |   48 |   12 |   999248 | 
| NULL     |   48 |   12 |   828290 | 
| NULL     |   48 |   12 |   828290 | 
| NULL     |   48 |   12 |   442242 | 
| abcdefgh |   48 |   12 |    88892 | 
| NULL     |   48 |   12 |       30 | 
| NULL     |   48 |   12 |   442244 | 
| abcdefgk |   48 |   12 |    88892 | 
+----------+------+------+----------+
12 rows in set (0.00 sec)

mysql> select if(isnull(name), 0, name) n, val1, val2, count(*)
    -> from test
    -> group by n, val1, val2;
+----------+------+------+----------+
| n        | val1 | val2 | count(*) |
+----------+------+------+----------+
| 0        |   48 |   12 |        6 | 
| abcdefgh |   48 |   12 |        3 | 
| abcdefgk |   32 |   12 |        1 | 
| abcdefgk |   48 |   12 |        1 | 
| abcdefgm |   48 |   12 |        1 | 
+----------+------+------+----------+
5 rows in set (0.01 sec)

mysql> select if(isnull(name), 0, name) n, val1, val2, count(distinct val3)
    -> from test
    -> group by n, val1, val2;
+----------+------+------+----------------------+
| n        | val1 | val2 | count(distinct val3) |
+----------+------+------+----------------------+
| 0        |   48 |   12 |                    5 | 
| abcdefgh |   48 |   12 |                    2 | 
| abcdefgk |   32 |   12 |                    1 | 
| abcdefgk |   48 |   12 |                    1 | 
| abcdefgm |   48 |   12 |                    1 | 
+----------+------+------+----------------------+
5 rows in set (0.00 sec)

mysql> select if(isnull(name), 0, (select id from DIM_test where DIM_test.name=test.name)) n,
    -> val1, val2, count(*)
    -> from test
    -> group by n, val1, val2;
+------+------+------+----------+
| n    | val1 | val2 | count(*) |
+------+------+------+----------+
|    0 |   48 |   12 |        6 | 
|    1 |   48 |   12 |        3 | 
|    3 |   32 |   12 |        1 | 
|    3 |   48 |   12 |        1 | 
|    5 |   48 |   12 |        1 | 
+------+------+------+----------+
5 rows in set (0.01 sec)

mysql> select if(isnull(name), 0, (select id from DIM_test where DIM_test.name=test.name)) n, val1, val2, count(distinct val3) from test group by n, val1, val2; 
+------+------+------+----------------------+
| n    | val1 | val2 | count(distinct val3) |
+------+------+------+----------------------+
|    3 |   32 |   12 |                    1 | 
|    3 |   48 |   12 |                    3 | 
|    0 |   48 |   12 |                    5 | 
+------+------+------+----------------------+
3 rows in set (0.01 sec)

The result above is incorrect. Actually, distinct matters:

mysql> select if(isnull(name), 0, (select id from DIM_test where DIM_test.name=test.name)) n, val1, val2, count(val3) from test group by n, val1, val2;
+------+------+------+-------------+
| n    | val1 | val2 | count(val3) |
+------+------+------+-------------+
|    0 |   48 |   12 |           6 | 
|    1 |   48 |   12 |           3 | 
|    3 |   32 |   12 |           1 | 
|    3 |   48 |   12 |           1 | 
|    5 |   48 |   12 |           1 | 
+------+------+------+-------------+
5 rows in set (0.01 sec)

Also EXPLAIN shows the reason:

mysql> explain select if(isnull(name), 0, (select id from DIM_test where DIM_test.name=test.name)) n, val1, val2, count(distinct val3) from test group by n, val1, val2\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: test
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 12
        Extra: Using filesort
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: DIM_test
         type: ref
possible_keys: name
          key: name
      key_len: 131
          ref: func
         rows: 2
        Extra: Using where
2 rows in set (0.00 sec)

For the query with correct results temporary table is used:

mysql> explain select if(isnull(name), 0, (select id from DIM_test where DIM_test.name=test.name)) n, val1, val2, count(val3) from test group by n, val1, val2\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: test
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 12
        Extra: Using temporary; Using filesort
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: DIM_test
         type: ref
possible_keys: name
          key: name
      key_len: 131
          ref: func
         rows: 2
        Extra: Using where
2 rows in set (0.00 sec)

Also see duplicate bug #38217.

Simplified test case:

create table t1 (a int, b int);
insert into t1 values (4, 40), (1, 10), (2, 20), (2, 20), (3, 30);
select (select t1.a) aa, count(distinct b) from t1 group by aa;
aa      count(distinct b)
4       1
1       1
2       1
2       2

As you can see, output data consists of 1) incorrect (duplicated groups,
lost group) and 2) unordered data.

Output without DISTINCT:

select (select t1.a) aa, count(b) from t1 group by aa;
aa      count(b)
1       1
2       2
3       1
4       1

An intermediate patch: it solves the problem for simple GROUP BY expressions ("BY aa, ...", where "aa" is a subselect alias)

Attachment: 45640.patch (text/x-patch), 4.15 KiB.

The sufficiency of the fix (see previous commentary with a patch)
depends on the fix for the bug #48703: the find_order_in_list function
doesn't do deep analysis of GROUP BY expressions but analyses/substitutesd
the topmost items only, so inner items are unchecked and unchanged.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/90437

3201 Gleb Shchepa	2009-11-14
      Bug #45640: optimizer bug produces wrong results
      
      Grouping by a subquery in a query with a distinct aggregate
      function lead to a wrong result (wrong and unordered
      grouping values).
      
      There are two related problems:
      
      1) The query like this:
      
         SELECT (SELECT t1.a) aa, COUNT(DISTINCT b) c
         FROM t1 GROUP BY aa
      
      returned wrong result, because the outer reference "t1.a"
      in the subquery was substituted with the Item_ref item.
      
      The Item_ref item obtains data from the result_field object
      that refreshes once after the end of each group. This data
      is not applicable to filesort since filesort() doesn't care
      about groups (and doesn't update result_field objects with
      copy_fields() and so on). Also that data is not applicable
      to group separation algorithm: end_send_group() checks every
      record with test_if_group_changed() that evaluates Item_ref
      items, but it refreshes those Item_ref-s only after the end
      of group, that is a vicious circle and the grouped column
      values in the output are shifted.
      
      Fix: if 
             a) we grouping by a subquery and 
             b) that subquery has outer references to FROM list
                of the grouping query,
           then we substitute these outer references with
           Item_direct_ref like references under aggregate
           functions: Item_direct_ref obtains data directly
           from the current record.
      
      
      2) The query with a non-trivial grouping expression like:
      
         SELECT (SELECT t1.a) aa, COUNT(DISTINCT b) c
         FROM t1 GROUP BY aa+0
      
      also returned wrong result, since JOIN::exec() substitutes
      references to top-level aliases in SELECT list with Item_copy
      caching items. Item_copy items have same refreshing policy
      as Item_ref items, so the whole groping expression with
      Item_copy inside returns wrong result in filesort() and
      end_send_group().
      
      Fix: include aliased items into GROUP BY item tree instead
           of Item_ref references to them.
     @ mysql-test/r/group_by.result
        Test case for bug #45640
     @ mysql-test/t/group_by.test
        Test case for bug #45640
     @ sql/item.cc
        Bug #45640: optimizer bug produces wrong results
        
        - Item_field::fix_fields() has been modified to resolve
          aliases in GROUP BY item trees into aliased items instead
          of Item_ref items.
        
        - Item_outer_ref::walk() has been overloaded to implement
          Item_outer_ref tree search algorithm.
     @ sql/item.h
        Bug #45640: optimizer bug produces wrong results
        
        Item::find_outer_ref_processor() helper has been introduced
        and Item_outer_ref::walk() has been overloaded to implement
        Item_outer_ref tree search algorithm.
     @ sql/mysql_priv.h
        Bug #45640: optimizer bug produces wrong results
        
        fix_inner_refs() has been modified to accept group_list
        parameter.
     @ sql/sql_lex.cc
        Bug #45640: optimizer bug produces wrong results
        
        Initialization of st_select_lex::group_fix_field has
        been added.
     @ sql/sql_lex.h
        Bug #45640: optimizer bug produces wrong results
        
        The st_select_lex::group_fix_field field has been introduced
        to control alias resolution in Itef_fied::fix_fields.
     @ sql/sql_select.cc
        Bug #45640: optimizer bug produces wrong results
        
        - The fix_inner_refs function has been modified to treat
          subquery outer references like outer fields under aggregate
          functions, if they are included in GROUP BY item tree.
        
        - The find_order_in_list function has been modified to
          fix Item_field alias fields included in the GROUP BY item
          trees in a special manner.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/95667

3295 Gleb Shchepa	2009-12-24
      Bug #45640: optimizer bug produces wrong results
      
      Grouping by a subquery in a query with a distinct aggregate
      function lead to a wrong result (wrong and unordered
      grouping values).
      
      There are two related problems:
      
      1) The query like this:
      
         SELECT (SELECT t1.a) aa, COUNT(DISTINCT b) c
         FROM t1 GROUP BY aa
      
      returned wrong result, because the outer reference "t1.a"
      in the subquery was substituted with the Item_ref item.
      
      The Item_ref item obtains data from the result_field object
      that refreshes once after the end of each group. This data
      is not applicable to filesort since filesort() doesn't care
      about groups (and doesn't update result_field objects with
      copy_fields() and so on). Also that data is not applicable
      to group separation algorithm: end_send_group() checks every
      record with test_if_group_changed() that evaluates Item_ref
      items, but it refreshes those Item_ref-s only after the end
      of group, that is a vicious circle and the grouped column
      values in the output are shifted.
      
      Fix: if
             a) we grouping by a subquery and
             b) that subquery has outer references to FROM list
                of the grouping query,
           then we substitute these outer references with
           Item_direct_ref like references under aggregate
           functions: Item_direct_ref obtains data directly
           from the current record.
      
      2) The query with a non-trivial grouping expression like:
      
         SELECT (SELECT t1.a) aa, COUNT(DISTINCT b) c
         FROM t1 GROUP BY aa+0
      
      also returned wrong result, since JOIN::exec() substitutes
      references to top-level aliases in SELECT list with Item_copy
      caching items. Item_copy items have same refreshing policy
      as Item_ref items, so the whole groping expression with
      Item_copy inside returns wrong result in filesort() and
      end_send_group().
      
      Fix: include aliased items into GROUP BY item tree instead
           of Item_ref references to them.
     @ mysql-test/r/group_by.result
        Test case for bug #45640
     @ mysql-test/t/group_by.test
        Test case for bug #45640
     @ sql/item.cc
        Bug #45640: optimizer bug produces wrong results
        
        Item_field::fix_fields() has been modified to resolve
        aliases in GROUP BY item trees into aliased items instead
        of Item_ref items.
     @ sql/item.h
        Bug #45640: optimizer bug produces wrong results
        
        - Item::find_item_processor() has been introduced.
        - Item_ref::walk() has been modified to apply processors
          to itself too (not only to referenced item).
     @ sql/mysql_priv.h
        Bug #45640: optimizer bug produces wrong results
        
        fix_inner_refs() has been modified to accept group_list
        parameter.
     @ sql/sql_lex.cc
        Bug #45640: optimizer bug produces wrong results
        
        Initialization of st_select_lex::group_fix_field has
        been added.
     @ sql/sql_lex.h
        Bug #45640: optimizer bug produces wrong results
        
        The st_select_lex::group_fix_field field has been introduced
        to control alias resolution in Itef_fied::fix_fields.
     @ sql/sql_select.cc
        Bug #45640: optimizer bug produces wrong results
        
        - The fix_inner_refs function has been modified to treat
          subquery outer references like outer fields under aggregate
          functions, if they are included in GROUP BY item tree.
        
        - The find_order_in_list function has been modified to
          fix Item_field alias fields included in the GROUP BY item
          trees in a special manner.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/99546

3370 Gleb Shchepa	2010-02-06
      Bug #45640: optimizer bug produces wrong results
      
      Grouping by a subquery in a query with a distinct aggregate
      function lead to a wrong result (wrong and unordered
      grouping values).
      
      There are two related problems:
      
      1) The query like this:
      
         SELECT (SELECT t1.a) aa, COUNT(DISTINCT b) c
         FROM t1 GROUP BY aa
      
      returned wrong result, because the outer reference "t1.a"
      in the subquery was substituted with the Item_ref item.
      
      The Item_ref item obtains data from the result_field object
      that refreshes once after the end of each group. This data
      is not applicable to filesort since filesort() doesn't care
      about groups (and doesn't update result_field objects with
      copy_fields() and so on). Also that data is not applicable
      to group separation algorithm: end_send_group() checks every
      record with test_if_group_changed() that evaluates Item_ref
      items, but it refreshes those Item_ref-s only after the end
      of group, that is a vicious circle and the grouped column
      values in the output are shifted.
      
      Fix: if
             a) we grouping by a subquery and
             b) that subquery has outer references to FROM list
                of the grouping query,
           then we substitute these outer references with
           Item_direct_ref like references under aggregate
           functions: Item_direct_ref obtains data directly
           from the current record.
      
      2) The query with a non-trivial grouping expression like:
      
         SELECT (SELECT t1.a) aa, COUNT(DISTINCT b) c
         FROM t1 GROUP BY aa+0
      
      also returned wrong result, since JOIN::exec() substitutes
      references to top-level aliases in SELECT list with Item_copy
      caching items. Item_copy items have same refreshing policy
      as Item_ref items, so the whole groping expression with
      Item_copy inside returns wrong result in filesort() and
      end_send_group().
      
      Fix: include aliased items into GROUP BY item tree instead
           of Item_ref references to them.
     @ mysql-test/r/group_by.result
        Test case for bug #45640
     @ mysql-test/t/group_by.test
        Test case for bug #45640
     @ sql/item.cc
        Bug #45640: optimizer bug produces wrong results
        
        Item_field::fix_fields() has been modified to resolve
        aliases in GROUP BY item trees into aliased items instead
        of Item_ref items.
     @ sql/item.h
        Bug #45640: optimizer bug produces wrong results
        
        - Item::find_item_processor() has been introduced.
        - Item_ref::walk() has been modified to apply processors
          to itself too (not only to referenced item).
     @ sql/mysql_priv.h
        Bug #45640: optimizer bug produces wrong results
        
        fix_inner_refs() has been modified to accept group_list
        parameter.
     @ sql/sql_lex.cc
        Bug #45640: optimizer bug produces wrong results
        
        Initialization of st_select_lex::group_fix_field has
        been added.
     @ sql/sql_lex.h
        Bug #45640: optimizer bug produces wrong results
        
        The st_select_lex::group_fix_field field has been introduced
        to control alias resolution in Itef_fied::fix_fields.
     @ sql/sql_select.cc
        Bug #45640: optimizer bug produces wrong results
        
        - The fix_inner_refs function has been modified to treat
          subquery outer references like outer fields under aggregate
          functions, if they are included in GROUP BY item tree.
        
        - The find_order_in_list function has been modified to
          fix Item_field alias fields included in the GROUP BY item
          trees in a special manner.

Pushed into 5.1.45 (revid:joro@sun.com-20100301083827-xnimmrjg6bh33o1o) (version source revid:joerg@mysql.com-20100212173307-ph563zr4wmoklgwd) (merge vers: 5.1.45) (pib:16)

Pushed into 6.0.14-alpha (revid:alik@sun.com-20100302142746-u1gxdf5yk2bjrq3e) (version source revid:alik@sun.com-20100225090938-2j5ybqoau570mytu) (merge vers: 6.0.14-alpha) (pib:16)

Pushed into 5.5.3-m2 (revid:alik@sun.com-20100302072233-t3uqgjzdukt1pyhe) (version source revid:alexey.kopytov@sun.com-20100209075938-mmcnyf6w631ozc45) (merge vers: 5.5.2-m2) (pib:16)

Pushed into mysql-next-mr (revid:alik@sun.com-20100302072432-k8xvfkgcggkwgi94) (version source revid:alik@sun.com-20100224135227-rcqs9pe9b2in80pf) (pib:16)

Noted in 5.1.45, 5.5.3, 6.0.14 changelogs.

Grouping by a subquery in a query with a DISTINCT aggregate function
led to incorrect and unordered grouping values.

Pushed into 5.1.47-ndb-7.0.16 (revid:martin.skold@mysql.com-20100617114014-bva0dy24yyd67697) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)

Pushed into 5.1.47-ndb-6.2.19 (revid:martin.skold@mysql.com-20100617115448-idrbic6gbki37h1c) (version source revid:martin.skold@mysql.com-20100609140708-52rvuyq4q500sxkq) (merge vers: 5.1.45-ndb-6.2.19) (pib:16)

Pushed into 5.1.47-ndb-6.3.35 (revid:martin.skold@mysql.com-20100617114611-61aqbb52j752y116) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)