MySQL Bugs: #18465: Optimizer/execution plan bug with dependent subqueries in ON clauses of JOINs

Bug #18465	Optimizer/execution plan bug with dependent subqueries in ON clauses of JOINs
Submitted:	23 Mar 2006 17:43	Modified:	25 Sep 2009 3:58
Reporter:	Beat Vontobel (Silver Quality Contributor) (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S2 (Serious)
Version:	5.0.20-BK, 5.0.19, 5.0.85	OS:	Any (any)
Assigned to:		CPU Architecture:	Any
Tags:	qc

Description:
There seems to be a serious deficiency in the optimizer for dependent subqueries in the ON clause of a JOIN: This makes it impossible to execute some very common SQL queries to JOIN a row and its successor (with respect to a certain column) for larger tables.

A "momentarily constant" (regarding one execution of the dependent subquery) value from the outer query seems not to be identified as such. Even if EXPLAIN tells so, no index seems to be used in the subquery and a full table scan seems to be done for _every_ execution of the dependent subquery, even if that is absolutely not necessary.

How to repeat:
Execute the following test script and look at the execution times of the three SELECTs and at the output of the three EXPLAINs:

-- ----------------------------------------------------------------

--
-- CREATE THE TABLE AND FILL IT WITH 1'000'000 ROWS OF TEST DATA
--

CREATE TABLE IF NOT EXISTS test (i INT PRIMARY KEY);

DROP PROCEDURE IF EXISTS fill_test;

delimiter //

CREATE PROCEDURE fill_test(e INT)
BEGIN
  DECLARE c INT DEFAULT 1;
  WHILE c <= e DO
    INSERT INTO test VALUES (c);
    SET c := c + 1;
  END WHILE;
END //

delimiter ;

CALL fill_test(1000000);

--
-- DO SOME QUERIES ON THE TABLE
-- (THE THIRD ONE IS THE BAD ONE)
--

EXPLAIN
SELECT MIN(i)
FROM   test
WHERE  i > 50;

EXPLAIN
SELECT     o1.i AS o1,
           o2.i AS o2
FROM       test AS o1
INNER JOIN test AS o2
ON         ( SELECT MIN(i)
             FROM   test
             WHERE  i > o1.i ) = o2.i
WHERE      o1.i = 50;

EXPLAIN
SELECT     o1.i AS o1, o2.i AS o2
FROM       test AS o1
INNER JOIN test AS o2
ON         ( SELECT MIN(i)
             FROM   test
             WHERE  i > o1.i ) = o2.i
WHERE      o1.i BETWEEN 50 AND 51;

--
-- EXPLAIN THE THREE QUERIES
--

EXPLAIN
SELECT MIN(i)
FROM   test
WHERE  i > 50;

EXPLAIN
SELECT     o1.i AS o1,
           o2.i AS o2
FROM       test AS o1
INNER JOIN test AS o2
ON         ( SELECT MIN(i)
             FROM   test
             WHERE  i > o1.i ) = o2.i
WHERE      o1.i = 50;

EXPLAIN
SELECT     o1.i AS o1, o2.i AS o2
FROM       test AS o1
INNER JOIN test AS o2
ON         ( SELECT MIN(i)
             FROM   test
             WHERE  i > o1.i ) = o2.i
WHERE      o1.i BETWEEN 50 AND 51;

--
-- CLEAN UP
--

DROP PROCEDURE fill_test;
DROP TABLE test;

-- ----------------------------------------------------------------

That's the output I get for the three SELECTs:

+--------+
| MIN(i) |
+--------+
|     51 |
+--------+
1 row in set (0.00 sec)

+----+----+
| o1 | o2 |
+----+----+
| 50 | 51 |
+----+----+
1 row in set (0.00 sec)

+----+----+
| o1 | o2 |
+----+----+
| 50 | 51 |
| 51 | 52 |
+----+----+
2 rows in set (3.15 sec)

The first two queries execute in almost no time, the third one needs over three seconds.

Now look at the EXPLAIN output for the three queries:

+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                        |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
|  1 | SIMPLE      | NULL  | NULL | NULL          | NULL | NULL    | NULL | NULL | Select tables optimized away |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+

+----+--------------------+-------+-------+---------------+---------+---------+-------+------+------------------------------+
| id | select_type        | table | type  | possible_keys | key     | key_len | ref   | rows | Extra                        |
+----+--------------------+-------+-------+---------------+---------+---------+-------+------+------------------------------+
|  1 | PRIMARY            | o1    | const | PRIMARY       | PRIMARY | 4       | const |    1 | Using index                  |
|  1 | PRIMARY            | o2    | const | PRIMARY       | PRIMARY | 4       | const |    1 | Using index                  |
|  2 | DEPENDENT SUBQUERY | NULL  | NULL  | NULL          | NULL    | NULL    | NULL  | NULL | Select tables optimized away |
+----+--------------------+-------+-------+---------------+---------+---------+-------+------+------------------------------+

+----+--------------------+-------+--------+---------------+---------+---------+------+---------+--------------------------+
| id | select_type        | table | type   | possible_keys | key     | key_len | ref  | rows    | Extra                    |
+----+--------------------+-------+--------+---------------+---------+---------+------+---------+--------------------------+
|  1 | PRIMARY            | o1    | range  | PRIMARY       | PRIMARY | 4       | NULL |       2 | Using where; Using index |
|  1 | PRIMARY            | o2    | eq_ref | PRIMARY       | PRIMARY | 4       | func |       1 | Using where; Using index |
|  2 | DEPENDENT SUBQUERY | test  | index  | PRIMARY       | PRIMARY | 4       | NULL | 1000000 | Using where; Using index |
+----+--------------------+-------+--------+---------------+---------+---------+------+---------+--------------------------+
3 rows in set (0.00 sec)

In the first two queries the dependent subquery can even be optimized away completely (in the second query the constant value for o1.i from the WHERE clause in the outer query will be propagated into the subquery) which is good and actually the expected behaviour. In the third query we ask for a range of rows in the outer query. Of course the dependent subquery now needs to be executed "as a function" once for every possible row from o1 (only two rows in this example) to find the according row from o2. The strange behaviour now is, that for every one of the rows from the outer query, the subquery has to scan _all_ the rows from the table (1'000'000). The EXPLAIN shows that the PRIMARY KEY is used in the subquery - but then no full table scan should be necessary. Only one row from the index should be retrieved, as the value for o1.i from the outer query actually is a constant for every single execution of the subquery.

Suggested fix:
The execution plan should be:

* Take one row from o1
** Execute dependent subquery with the constant value o1.i _using_ the KEY
** Find the according row from o2

Thank you for a bug report. Verified just as described with 5.0.20-BK (ChangeSet@1.2108, 2006-03-23 22:29:53+01:00) on Linux.

Thank you for your bug report. We are aware of the problem described here. Solving the problem will require new features to be introduced into the query optimizer. They will be available in a future release.

I reduced a customer query to another case of this mis-optimization. This is a straight forward solution of the groupwise maximum problem using a subquery. But this time there is a workaround:

--preparations
drop table if exists t1;
create table t1 (c1 int unsigned primary key, c2 int, c3 int);
insert into t1 values (1,1,2), (2,1,3), (3,2,2), (4,2,3), (5,2,4);

-- this is the bad query
explain select A.* from t1 A where concat(A.c2, "-", A.c3) in (select concat(B.c2, "-", max(B.c3)) from t1 B group by B.c2) \G

-- and this is the workaround, doing per-component comparison
explain select A.* from t1 A join (select B.c2 _c2, max(B.c3) _c3 from t1 B group by B.c2) C on A.c2=_c2 and A.c3=_c3 \G

Remark to my previous comment: 

my example does not match the original bug title (the subquery is not in the ON part but uses IN). But IMHO it's the same path in the optimizer - where extra conditions are added to the inner query that make it depend on the outer table.