Bug #54481 GROUP BY loses effect with JOIN + ORDER BY + LIMIT and join caching
Submitted: 14 Jun 2010 11:10 Modified: 15 Oct 2012 13:43
Reporter: John Embretsen Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S2 (Serious)
Version:5.1-bzr, 5.6.99-m4, mysql-next-mr-opt-backporting OS:Any
Assigned to: Guilhem Bichot CPU Architecture:Any
Tags: join_cache_level, optimizer_switch

[14 Jun 2010 11:10] John Embretsen
Description:
When a server with optimizer_join_cache_level=5 (or 6, 7 or 8) executes a query using GROUP BY with LIMIT such as:

SELECT  t1.col_int_key AS field1, 
        t1.col_datetime_key 
FROM t1 
  JOIN (t3 JOIN t2 ON t2.col_int_key = t3.pk ) 
  ON (t2.pk >= t1 . `col_int_key` ) 
GROUP BY field1 
HAVING field1 <= 2 
ORDER BY field1, t1.col_datetime_key 
LIMIT 2;

wrong results are returned. It seems like the GROUP BY clause loses its effect.

mysql> SELECT  t1.col_int_key AS field1, t1.col_datetime_key FROM t1 JOIN (t3 JOIN t2 ON t2.col_int_key = t3.pk ) ON t2.pk >= t1.col_int_key GROUP BY field1 HAVING field1 <= 2 ORDER BY field1, t1.col_datetime_key LIMIT 2;
+--------+---------------------+
| field1 | col_datetime_key    |
+--------+---------------------+
|      0 | 2000-09-26 07:45:57 |
|      0 | 2000-09-26 07:45:57 |
+--------+---------------------+
2 rows in set (0.01 sec)

Results without LIMIT:

mysql> SELECT  t1.col_int_key AS field1, t1.col_datetime_key FROM t1 JOIN (t3 JOIN t2 ON t2.col_int_key = t3.pk ) ON t2.pk >= t1.col_int_key GROUP BY field1 HAVING field1 <= 2 ORDER BY field1, t1.col_datetime_key;
+--------+---------------------+
| field1 | col_datetime_key    |
+--------+---------------------+
|      0 | 2000-09-26 07:45:57 |
|      2 | 2003-02-11 21:19:41 |
+--------+---------------------+
2 rows in set (0.00 sec)

The latter result is the correct result for both queries, as is the case with optimizer_join_cache_level=0,1,2,3,4 and with MySQL 5.1.

(The optimizer_join_cache_level settings of 5-8 enable new features in mysql-next-mr-opt-backporting.)

How to repeat:
DROP TABLE IF EXISTS t1;
DROP TABLE IF EXISTS t2;
DROP TABLE IF EXISTS t3;

CREATE TABLE t1 (
  `pk` int NOT NULL AUTO_INCREMENT,
  `col_int_key` int DEFAULT NULL,
  `col_datetime_key` datetime DEFAULT NULL,
  PRIMARY KEY (`pk`),
  KEY `col_int_key` (`col_int_key`)
);

INSERT INTO t1 VALUES (15,2,'2003-02-11 21:19:41');
INSERT INTO t1 VALUES (16,3,'2009-10-18 02:27:49');
INSERT INTO t1 VALUES (17,0,'2000-09-26 07:45:57');

CREATE TABLE t2 (
  `pk` int NOT NULL AUTO_INCREMENT,
  `col_int_key` int DEFAULT NULL,
  `col_datetime_key` datetime DEFAULT NULL,
  PRIMARY KEY (`pk`),
  KEY `col_int_key` (`col_int_key`)
);

INSERT INTO t2 VALUES (14,1,'2000-04-21 00:00:00');
INSERT INTO t2 VALUES (98,1,'2009-08-08 22:38:53');

CREATE TABLE t3 (
  `pk` int NOT NULL AUTO_INCREMENT,
  `col_int_key` int DEFAULT NULL,
  `col_datetime_key` datetime DEFAULT NULL,
  PRIMARY KEY (`pk`),
  KEY `col_int_key` (`col_int_key`)
);

INSERT INTO t3 VALUES (1,7,'2005-02-05 00:00:00');

SET SESSION optimizer_join_cache_level=8;

SELECT  t1.col_int_key AS field1, 
        t1.col_datetime_key 
FROM t1 
  JOIN (t3 JOIN t2 ON t2.col_int_key = t3.pk ) 
  ON t2.pk >= t1.col_int_key 
GROUP BY field1 
HAVING field1 <= 2 
ORDER BY field1, t1.col_datetime_key 
LIMIT 2;

SET SESSION optimizer_join_cache_level=1;

SELECT  t1.col_int_key AS field1, 
        t1.col_datetime_key 
FROM t1 
  JOIN (t3 JOIN t2 ON t2.col_int_key = t3.pk ) 
  ON t2.pk >= t1.col_int_key 
GROUP BY field1 
HAVING field1 <= 2 
ORDER BY field1, t1.col_datetime_key 
LIMIT 2;
[14 Jun 2010 11:13] John Embretsen
Issue has QA impact by causing diffs when comparing RQG results against 5.1 or another server without new optimizer features.
First observed with the RQG's optimizer_no_subquery.yy grammar.
[28 Jun 2010 9:27] Guilhem Bichot
Apparently GROUP BY is implicitely done by using an index scan on the proper table, but because there is disk-sweep MRR, rows get shuffled (as the order is then dictated by the rowids of the next table), so that rows of one group are not contiguous which breaks GROUP BY. There seems to be guards against this ("don't use implicit order if join buffering", or "don't do join buffering if implicit order") in code but apparently they don't catch this case.
[26 Jul 2010 16:24] Guilhem Bichot
more minimal testcase with explanation:

CREATE TABLE t1 (
  `col_int_key` int,
  `col_datetime` datetime,
  KEY `col_int_key` (`col_int_key`)
);

INSERT INTO t1 VALUES (2,'2003-02-11 21:19:41');
INSERT INTO t1 VALUES (3,'2009-10-18 02:27:49');
INSERT INTO t1 VALUES (0,'2000-09-26 07:45:57');

CREATE TABLE t2 (
  `col_int` int,
  `col_int_key` int,
  KEY `col_int_key` (`col_int_key`)
);

INSERT INTO t2 VALUES (14,1);
INSERT INTO t2 VALUES (98,1);

# The WHERE is true for all rows of t2
let $query=SELECT t1.col_int_key, t1.col_datetime 
FROM t1,t2
WHERE t2.col_int_key = 1 AND t2.col_int >= 3
GROUP BY t1.col_int_key
ORDER BY t1.col_int_key, t1.col_datetime
LIMIT 2;

eval explain $query;
eval $query;

# issue starts at level 5.
# Scenario: index scan on t1 will deliver rows in order for easy
# groupping and ordering: each row of t1 will either accumulate with
# the current group or end the current group and define a new group;
# as t2 does BKA, t1's rows are buffered, then (BKA) two rows from t2
# are fetched with index lookup for value 1, then each of those two
# rows is joined with the 3 cached rows of t1, so what is sent to the
# next stage looks like 0,2,3,0,2,3, which defines 6 groups, then
# those 6 are ordered (0,0,2,2,3,3) and then the two first are taken:
# 0,0: wrong. In other words, t2's BKA breaks the ordering obtained
# from t1. Either t2 shouldn't do BKA to preserve ordering, or no
# order should be assumed (an explicit sort should be done).

SET SESSION optimizer_join_cache_level=8;

eval explain $query;
eval $query;

drop table t1,t2;
[26 Jul 2010 16:49] Guilhem Bichot
MySQL recognizes that index scan on t1 will do automatic groupping for GROUP BY, and then MySQL will do an explicit sorting to satisfy ORDER BY.
The existing logic (to be fixed) is in make_join_readinfo():
  /* 
    If a join buffer is used to join a table the ordering by an index
    for the first non-constant table cannot be employed anymore.
  */
[27 Jul 2010 8:04] Øystein Grøvlen
Looks a bit similar to a bug I worked on earlier. Signing up as reviewer.
[30 Jul 2010 14:54] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/114767

3219 Guilhem Bichot	2010-07-30
      Fix for BUG#54481 "GROUP BY loses effect with JOIN + ORDER BY + LIMIT
      and join_cache_level=5-8". Join buffering yields tuples in non-sorted
      order, needs an explicit sort. See comment of sql_select.cc for details.
     @ mysql-test/include/join_cache.inc
        test for bug
     @ mysql-test/r/join_cache_jcl1.result
        before the fix, the final SELECT would return 0 and 0 in the first column
        of the two rows, instead of 0 and 2.
        Note how this final SELECT, which has FORCE INDEX, does a table scan on t1,
        whereas we could have expected an index scan due to FORCE INDEX.
        This is explained like this:
        - index scan is done for GROUP BY only if JOIN::simple_group is
        true; indeed around line sql_select.cc:2286,
        test_if_skip_sort_order(), which may pick index scan
        (JT_NEXT), is not called if simple_group is false
        - with the code patch, make_join_readinfo() now sets
        simple_group to "false" for this query due to join buffering,
        in order to force an explicit sort, thus index scan isn't picked.
     @ mysql-test/r/join_cache_jcl2.result
        before the fix, the final SELECT would return 0 and 0 in the first column
        of the two rows, instead of 0 and 2.
     @ mysql-test/r/join_cache_jcl3.result
        before the fix, the final SELECT would return 0 and 0 in the first column
        of the two rows, instead of 0 and 2.
     @ mysql-test/r/join_cache_jcl4.result
        before the fix, the final SELECT would return 0 and 0 in the first column
        of the two rows, instead of 0 and 2.
     @ mysql-test/r/join_cache_jcl5.result
        before the fix, the two final SELECTs would return 0 and 0 in the first column
        of the two rows, instead of 0 and 2.
     @ mysql-test/r/join_cache_jcl6.result
        before the fix, the two final SELECTs would return 0 and 0 in the first column
        of the two rows, instead of 0 and 2.
     @ mysql-test/r/join_cache_jcl7.result
        before the fix, the two final SELECTs would return 0 and 0 in the first column
        of the two rows, instead of 0 and 2.
     @ mysql-test/r/join_cache_jcl8.result
        before the fix, the two final SELECTs would return 0 and 0 in the first column
        of the two rows, instead of 0 and 2.
     @ sql/sql_select.cc
        Here is the scenario of the bug's testcase.
        First, assume join buffering is disabled.
        We read table t1 with index scan on the "t1.col_int_key" index, this
        yields tuples ordered by t1.col_int_key, which is the GROUP BY column:
        such tuple is joined with all tuples from t2, using nested-loop join
        (non-block) and the result is sent to end_write_group(): this function
        receives tuples in this order:
        tuple1_from_t1 | tuple1_from_t2
        tuple1_from_t1 | tuple2_from_t2
        tuple2_from_t1 | tuple1_from_t2
        tuple2_from_t1 | tuple2_from_t2
        tuple3_from_t1 | tuple1_from_t2
        tuple3_from_t1 | tuple2_from_t2
        GROUP BY is done with end_write_group(): each new tuple, if its GROUP BY
        column value is different from the previous tuple's, defines a new
        group. So we get three groups:
        tuple1_from_t1 | some_tuple_from_t2
        tuple2_from_t1 | some_tuple_from_t2
        tuple3_from_t1 | some_tuple_from_t2
        and after applying "ORDER BY" (a sort on t1.col_int_key and
        t1.col_datetime) and "LIMIT 2" and keeping only desired columns, we get
        tuple1_from_t1
        tuple2_from_t1
        Now, assume table t2 does join buffering (BKA in the bug's case,
        but bug exists also with block-nested-loop, see join_cache_jcl1.result).
        t2 caches tuples from t1, then all tuples from t2 are read, then each
        tuple from t2 is joined with tuples from t1 found in the cache
        (JOIN_CACHE_BKA::join_matching_records()), so what end_write_group()
        receives is (note the order: we join one tuple from t2 with all tuples
        from t1):
        tuple1_from_t1 | tuple1_from_t2
        tuple2_from_t1 | tuple1_from_t2
        tuple3_from_t1 | tuple1_from_t2
        tuple1_from_t1 | tuple2_from_t2
        tuple2_from_t1 | tuple2_from_t2
        tuple3_from_t1 | tuple2_from_t2
        and end_write_group() produces those groups:
        tuple1_from_t1 | tuple1_from_t2
        tuple2_from_t1 | tuple1_from_t2
        tuple3_from_t1 | tuple1_from_t2
        tuple1_from_t1 | tuple2_from_t2
        tuple2_from_t1 | tuple2_from_t2
        tuple3_from_t1 | tuple2_from_t2
        After "ORDER BY":
        tuple1_from_t1 | tuple1_from_t2
        tuple1_from_t1 | tuple2_from_t2
        tuple2_from_t1 | tuple1_from_t2
        tuple2_from_t1 | tuple2_from_t2
        tuple3_from_t1 | tuple1_from_t2
        tuple3_from_t1 | tuple2_from_t2
        and after "LIMIT 2":
        tuple1_from_t1 | tuple1_from_t2
        tuple1_from_t1 | tuple2_from_t2
        which is wrong.
        The problem is that join buffering changes the order of tuples in a way
        which is not suitable for end_write_group() (a GROUP BY column value
        different from previous does not necessarily signal an unseen
        value!). When such order change happens, end_write_group() shouldn't be
        used: we should first send joined tuples to a temporary table using
        end_write(), then do a sorting pass to sort on the GROUP BY column, and
        then do groupping (and then do ORDER BY and LIMIT).
        There is already code to handle this situation, in make_join_readinfo()
        (added by the fix for BUG 42955, which is a bug similar to ours):
          /* 
            If a join buffer is used to join a table the ordering by an index
            for the first non-constant table cannot be employed anymore.
          */
          for (i=join->const_tables ; i < join->tables ; i++)
          {
            JOIN_TAB *tab=join->join_tab+i;
            if (tab->use_join_cache)
            {
              JOIN_TAB *sort_by_tab= join->get_sort_by_join_tab();
              if (sort_by_tab && !join->need_tmp)
              {
                join->need_tmp= 1;
                join->simple_order= join->simple_group= 0;
                if (sort_by_tab->type == JT_NEXT)
                {
                  sort_by_tab->type= JT_ALL;
                  sort_by_tab->read_first_record= join_init_read_record;
                }
              }
              break;
            }
          }
        When this code works, it instructs, with
        need_tmp=1,simple_order=simple_group=0, to do an explicit sorting of
        joined tuples, not relying on the order returned by the index (as this
        order is lost by join buffering of the next tables). Note how this means
        that join buffering wins over index ordering, whereas in 5.1 it's the
        opposite choice (see make_join_readinfo() in 5.1).
        In our scenario the code doesn't work, because join->need_tmp is already
        true. The code believes that as need_tmp is true, sorting is already
        programmed to happen, so index order will not be relied upon anyway. In
        our case, need_tmp is true because we have GROUP BY and ORDER BY on
        different columns (see how need_tmp is set around comment "Check if we
        need to create a temporary table" in JOIN::optimize()). We need to
        buffer results of GROUP BY in a temporary table so that we can later
        sort them for ORDER BY (this is why we use end_write_group(), writing to
        a temporary table, instead of end_send_group() which would send groups
        directly to the client). We are in a case where we use a temporary table
        (for ORDER BY) but also plan to rely on index order (for GROUP BY).
        The fix is to remove the dependency of this code on join->need_tmp. This
        way, explicit sorting is done for GROUP BY.
        A similar change is done to make_join_orderinfo(), even though this
        function is currently unused.
        After this change, all tests pass except that for this query in
        group_min_max.test:
         EXPLAIN SELECT COUNT(DISTINCT t1_1.a) FROM t1 t1_1, t1 t1_2 GROUP BY t1_1.a;
        table scan for t1_1 is now picked instead of index scan.
        This is because the "if(sort_by_tab)" block is now entered, and so
        JT_NEXT (index scan) is changed to JT_ALL (table scan).
        While this may be ok, a second code change is proposed: don't switch
        back to JT_ALL. After all, JT_NEXT may be a better choice, for example
        if we are using index condition pushdown. Even for
         SELECT COUNT(*) FROM t1 t1_1, t1 t1_2 GROUP BY t1_1.a;
        and without any code change, MySQL picks index scan, so it may have a
        good reason. The proposal here is to respect this decision. There is no
        obvious reason why the need to do explicit sorting would mandate a table
        scan instead of an index scan. Not relying on index order shouldn't mean
        refusing to do index scan. As a result of this second code change,
        group_min_max.test passes, but this query in join_cache_jcl*.test:
         explain select t1.a, count(t2.p) as count
         from t1 left join t2 on t1.a=t2.a and t2.p % 2 = 1 group by t1.a;
        switches from "table scan" to "index scan". It actually had "index scan"
        before the fix for BUG 42955 changed it to "table scan"; that result
        file change had comment "Adjusted results for a test case", so it is
        assumed that reverting to the old result is not a problem; the SELECT
        (without EXPLAIN) at least returns unchanged results.
        Unrelated change: the unlikely(s->keyuse) was lost by a merge
        (epotemkin@mysql.com-20100726113407-97mj9zm97haxsf6t), we restore it
        as it has speed advantages (was added as part of BUG 50595).
[2 Aug 2010 18:17] Guilhem Bichot
Needs a new triage based on newly found versions impacted, see below.

As suggested by Olav, I verified that this bug isn't only in the new optimizer features (next-mr-opt-backporting) but exists also in:
- mysql-5.1-bugteam davi.arnaut@oracle.com-20100730203310-wff7axt7ykr7ybn4
- mysql-trunk-bugfixing alik@sun.com-20100731074505-q8qeutt9on7f51wp
- mysql-next-mr-bugfixing alik@sun.com-20100731074550-wc16wkez5zyy0erh
Those versions don't have BKA but the bug can happen with block-nested loop; when I run the test below in the above versions, the second SELECT returns wrong rows:
0	2000-09-26 07:45:57
0	2000-09-26 07:45:57
instead of the correct:
0	2000-09-26 07:45:57
2	2003-02-11 21:19:41

When I run the same test in next-mr-opt-backporting, the second SELECT is identically wrong, and with optimizer_join_cache_level>=5 the first SELECT is also wrong.

Test file:

CREATE TABLE t1 (
  `col_int_key` int,
  `col_datetime` datetime,
  KEY `col_int_key` (`col_int_key`)
);

INSERT INTO t1 VALUES (2,'2003-02-11 21:19:41');
INSERT INTO t1 VALUES (3,'2009-10-18 02:27:49');
INSERT INTO t1 VALUES (0,'2000-09-26 07:45:57');

CREATE TABLE t2 (
  `col_int` int,
  `col_int_key` int,
  KEY `col_int_key` (`col_int_key`)
);

INSERT INTO t2 VALUES (14,1);
INSERT INTO t2 VALUES (98,1);

let $query=SELECT t1.col_int_key, t1.col_datetime 
FROM t1,t2
WHERE t2.col_int_key = 1 AND t2.col_int >= 3
GROUP BY t1.col_int_key
ORDER BY t1.col_int_key, t1.col_datetime
LIMIT 2;

eval explain $query;
eval $query;

# by disabling one index and forcing another, we hit 
# block-nested-loop join and see the same bug
let $query=SELECT t1.col_int_key, t1.col_datetime 
FROM t1 force index (col_int_key), t2 ignore index (col_int_key)
WHERE t2.col_int_key = 1 AND t2.col_int >= 3
GROUP BY t1.col_int_key
ORDER BY t1.col_int_key, t1.col_datetime
LIMIT 2;

eval explain $query;
eval $query;

drop table t1,t2;
[11 Aug 2010 13:28] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/115494

3224 Guilhem Bichot	2010-08-11
      Fix for BUG#54481 "GROUP BY loses effect with JOIN + ORDER BY + LIMIT
      and join_cache_level=5-8". Join buffering yields tuples in non-sorted
      order, needs an explicit sort. See comment of sql_select.cc for details.
     @ mysql-test/include/join_cache.inc
        test for bug
     @ mysql-test/r/join_cache_jcl1.result
        before the fix, the final SELECT would return 0 and 0 in the first column
        of the two rows, instead of 0 and 2.
        Note how this final SELECT, which has FORCE INDEX, does a table scan on t1,
        whereas we could have expected an index scan due to FORCE INDEX.
        This is explained like this:
        - index scan is done for GROUP BY only if JOIN::simple_group is
        true; indeed around line sql_select.cc:2286,
        test_if_skip_sort_order(), which may pick index scan
        (JT_NEXT), is not called if simple_group is false
        - with the code patch, make_join_readinfo() now sets
        simple_group to "false" for this query due to join buffering,
        in order to force an explicit sort, thus index scan isn't picked.
     @ mysql-test/r/join_cache_jcl2.result
        before the fix, the final SELECT would return 0 and 0 in the first column
        of the two rows, instead of 0 and 2.
     @ mysql-test/r/join_cache_jcl3.result
        before the fix, the final SELECT would return 0 and 0 in the first column
        of the two rows, instead of 0 and 2.
     @ mysql-test/r/join_cache_jcl4.result
        before the fix, the final SELECT would return 0 and 0 in the first column
        of the two rows, instead of 0 and 2.
     @ mysql-test/r/join_cache_jcl5.result
        before the fix, the two final SELECTs would return 0 and 0 in the first column
        of the two rows, instead of 0 and 2.
     @ mysql-test/r/join_cache_jcl6.result
        before the fix, the two final SELECTs would return 0 and 0 in the first column
        of the two rows, instead of 0 and 2.
     @ mysql-test/r/join_cache_jcl7.result
        before the fix, the two final SELECTs would return 0 and 0 in the first column
        of the two rows, instead of 0 and 2.
     @ mysql-test/r/join_cache_jcl8.result
        before the fix, the two final SELECTs would return 0 and 0 in the first column
        of the two rows, instead of 0 and 2.
     @ sql/sql_select.cc
        Here is the scenario of the bug's testcase.
        First, assume join buffering is disabled.
        We read table t1 with index scan on the "t1.col_int_key" index, this
        yields tuples ordered by t1.col_int_key, which is the GROUP BY column:
        such tuple is joined with all tuples from t2, using nested-loop join
        (non-block) and the result is sent to end_write_group(): this function
        receives tuples in this order:
        tuple1_from_t1 | tuple1_from_t2
        tuple1_from_t1 | tuple2_from_t2
        tuple2_from_t1 | tuple1_from_t2
        tuple2_from_t1 | tuple2_from_t2
        tuple3_from_t1 | tuple1_from_t2
        tuple3_from_t1 | tuple2_from_t2
        GROUP BY is done with end_write_group(): each new tuple, if its GROUP BY
        column value is different from the previous tuple's, defines a new
        group. So we get three groups:
        tuple1_from_t1 | some_tuple_from_t2
        tuple2_from_t1 | some_tuple_from_t2
        tuple3_from_t1 | some_tuple_from_t2
        and after applying "ORDER BY" (a sort on t1.col_int_key and
        t1.col_datetime) and "LIMIT 2" and keeping only desired columns, we get
        tuple1_from_t1
        tuple2_from_t1
        Now, assume table t2 does join buffering (BKA in the bug's case,
        but bug exists also with block-nested-loop, see join_cache_jcl1.result).
        t2 caches tuples from t1, then all tuples from t2 are read, then each
        tuple from t2 is joined with tuples from t1 found in the cache
        (JOIN_CACHE_BKA::join_matching_records()), so what end_write_group()
        receives is (note the order: we join one tuple from t2 with all tuples
        from t1):
        tuple1_from_t1 | tuple1_from_t2
        tuple2_from_t1 | tuple1_from_t2
        tuple3_from_t1 | tuple1_from_t2
        tuple1_from_t1 | tuple2_from_t2
        tuple2_from_t1 | tuple2_from_t2
        tuple3_from_t1 | tuple2_from_t2
        and end_write_group() produces those groups:
        tuple1_from_t1 | tuple1_from_t2
        tuple2_from_t1 | tuple1_from_t2
        tuple3_from_t1 | tuple1_from_t2
        tuple1_from_t1 | tuple2_from_t2
        tuple2_from_t1 | tuple2_from_t2
        tuple3_from_t1 | tuple2_from_t2
        After "ORDER BY":
        tuple1_from_t1 | tuple1_from_t2
        tuple1_from_t1 | tuple2_from_t2
        tuple2_from_t1 | tuple1_from_t2
        tuple2_from_t1 | tuple2_from_t2
        tuple3_from_t1 | tuple1_from_t2
        tuple3_from_t1 | tuple2_from_t2
        and after "LIMIT 2":
        tuple1_from_t1 | tuple1_from_t2
        tuple1_from_t1 | tuple2_from_t2
        which is wrong.
        The problem is that join buffering changes the order of tuples in a way
        which is not suitable for end_write_group() (a GROUP BY column value
        different from previous does not necessarily signal an unseen
        value!). When such order change happens, end_write_group() shouldn't be
        used: we should first send joined tuples to a temporary table using
        end_write(), then do a sorting pass to sort on the GROUP BY column, and
        then do grouping (and then do ORDER BY and LIMIT).
        There is already code to handle this situation, in make_join_readinfo()
        (added by the fix for BUG 42955, which is a bug similar to ours):
          /* 
            If a join buffer is used to join a table the ordering by an index
            for the first non-constant table cannot be employed anymore.
          */
          for (i=join->const_tables ; i < join->tables ; i++)
          {
            JOIN_TAB *tab=join->join_tab+i;
            if (tab->use_join_cache)
            {
              JOIN_TAB *sort_by_tab= join->get_sort_by_join_tab();
              if (sort_by_tab && !join->need_tmp)
              {
                join->need_tmp= 1;
                join->simple_order= join->simple_group= 0;
                if (sort_by_tab->type == JT_NEXT)
                {
                  sort_by_tab->type= JT_ALL;
                  sort_by_tab->read_first_record= join_init_read_record;
                }
              }
              break;
            }
          }
        When this code works, it instructs, with
        need_tmp=1,simple_order=simple_group=0, to do an explicit sorting of
        joined tuples, not relying on the order returned by the index (as this
        order is lost by join buffering of the next tables). Note how this means
        that join buffering wins over index ordering, whereas in 5.1 it's the
        opposite choice (see make_join_readinfo() in 5.1).
        In our scenario the code doesn't work, because join->need_tmp is already
        true. The code believes that as need_tmp is true, sorting is already
        programmed to happen, so index order will not be relied upon anyway. In
        our case, need_tmp is true because we have GROUP BY and ORDER BY on
        different columns (see how need_tmp is set around comment "Check if we
        need to create a temporary table" in JOIN::optimize()). We need to
        buffer results of GROUP BY in a temporary table so that we can later
        sort them for ORDER BY (this is why we use end_write_group(), writing to
        a temporary table, instead of end_send_group() which would send groups
        directly to the client). We are in a case where we use a temporary table
        (for ORDER BY) but also plan to rely on index order (for GROUP BY).
        The fix is to remove the dependency of this code on join->need_tmp. This
        way, explicit sorting is done for GROUP BY.
        A similar change is done to make_join_orderinfo(), even though this
        function is currently unused.
        After this change, all tests pass except that for this query in
        group_min_max.test:
         EXPLAIN SELECT COUNT(DISTINCT t1_1.a) FROM t1 t1_1, t1 t1_2 GROUP BY t1_1.a;
        table scan for t1_1 is now picked instead of index scan.
        This is because the "if(sort_by_tab)" block is now entered, and so
        JT_NEXT (index scan) is changed to JT_ALL (table scan).
        While this may be ok, a second code change is proposed: don't switch
        back to JT_ALL. After all, JT_NEXT may be a better choice, for example
        if we are using index condition pushdown. Even for
         SELECT COUNT(*) FROM t1 t1_1, t1 t1_2 GROUP BY t1_1.a;
        and without any code change, MySQL picks index scan, so it may have a
        good reason. The proposal here is to respect this decision. There is no
        obvious reason why the need to do explicit sorting would mandate a table
        scan instead of an index scan. Not relying on index order shouldn't mean
        refusing to do index scan. As a result of this second code change,
        group_min_max.test passes, but this query in join_cache_jcl*.test:
         explain select t1.a, count(t2.p) as count
         from t1 left join t2 on t1.a=t2.a and t2.p % 2 = 1 group by t1.a;
        switches from "table scan" to "index scan". It actually had "index scan"
        before the fix for BUG 42955 changed it to "table scan"; that result
        file change had comment "Adjusted results for a test case", so it is
        assumed that reverting to the old result is not a problem; the SELECT
        (without EXPLAIN) at least returns unchanged results.
     @ sql/sql_select.h
        need_tmp==true does not mean that we won't rely on the ordered
        output of an index (see sql_select.cc for more details).
[12 Aug 2010 9:09] Guilhem Bichot
queued to next-mr-opt-backporting
[25 Aug 2010 20:24] Guilhem Bichot
- The bug has been fixed in next-mr-opt-backporting, and will reach by merge (we merge every week or so) next-mr-opt-team and then next-mr-bugfixing.
- About E/R for backporting to 5.1/5.5/trunk: those have different code from next-mr. They prefer using the index's order over using join buffering (if there is index order don't use join buffering), whereas next-mr has the opposite preference (if there is join buffering don't use index order but a sort).
So in next-mr, the fix has the effect of adding a sort in some cases. Whereas in 5.1/5.5/trunk the same fix would disable join buffering in some cases, and as the information "we use index order" is approximative in the server (has false positives), join buffering would be disabled more often than necessary (would slow down innocent queries not exposed to the present bug).
So the same fix would have different effects and effects in 5.1/5.5/trunk would be stronger with a slowdown.
So it cannot just be a backport, more work (at least a week) would be needed to study the 5.1 code and produce a more fine-tuned patch: E4/R3.
[25 Aug 2010 20:27] Guilhem Bichot
Note that the bug for 5.1 was found by me, and the query is
SELECT t1.col_int_key, t1.col_datetime 
FROM t1 force index (col_int_key), t2 ignore index (col_int_key)
WHERE t2.col_int_key = 1 AND t2.col_int >= 3
GROUP BY t1.col_int_key
ORDER BY t1.col_int_key, t1.col_datetime
LIMIT 2;
which is not so simple (FORCE INDEX, IGNORE INDEX, GROUP BY, ORDER BY, LIMIT).
So maybe this isn't such a widespread problem and fixing this in next-mr is enough?
[2 Oct 2010 18:15] Bugs System
Pushed into mysql-next-mr (revid:alexander.nozdrin@oracle.com-20101002181053-6iotvl26uurcoryp) (version source revid:alexander.nozdrin@oracle.com-20101002180917-h0n62akupm3z20nt) (pib:21)
[13 Nov 2010 16:14] Bugs System
Pushed into mysql-trunk 5.6.99-m5 (revid:alexander.nozdrin@oracle.com-20101113155825-czmva9kg4n31anmu) (version source revid:vasil.dimov@oracle.com-20100629074804-359l9m9gniauxr94) (merge vers: 5.6.99-m4) (pib:21)
[22 Nov 2010 1:28] Paul DuBois
Bug is not in any released 5.6.x version. No 5.6.x changelog entry needed.
[22 Nov 2010 1:30] Paul DuBois
I set this to Need Merge based on the possibility that there will be a 5.1 (5.1/5.5?) backport. If that will not be done, please just set this bug to Closed. Thanks.
[13 Jan 2011 19:24] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://dummy
[15 Oct 2012 13:43] Erlend Dahl
Fixed in 5.6.1