MySQL Bugs: #17641: wrong WHERE / JOIN Clause Optimization

Bug #17641	wrong WHERE / JOIN Clause Optimization
Submitted:	22 Feb 2006 10:34	Modified:	24 Jan 2014 12:17
Reporter:	Andre Timmer	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S2 (Serious)
Version:	5.0.18	OS:	Any (All)
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
Where join's are executed first and then the where clause.
At least in query below.

How to repeat:
==> Query
select
      bb.bkwi_id
,     bb.bkwi_volgnr
,     bb.bkwi_persoonsnr_marktselect
,     aa.achternaam                 
,     aa.titels_achter
,     bb.code_functie
,     cc.omschrijving
from
      contactpersoon aa
      join      functie     bb  on (bb.bkwi_persoonsnr_marktselect =
aa.bkwi_persoonsnr_marktselect)
      left join ref_functie cc  on (cc.code = bb.code_functie)
where exists (
             select ''
             from   vestiging xx
             where  xx.bkwi_id     = bb.bkwi_id
             and    xx.bkwi_volgnr = bb.bkwi_volgnr
             and    xx.ind_economisch_actief = 1
             );

==> Plan
+----+--------------------+-------+--------+---------------+------------+---------+------------------------------------+------+--------------------------+
| id | select_type        | table | type   | possible_keys | key        | key_len | ref                                | rows | Extra                    |
+----+--------------------+-------+--------+---------------+------------+---------+------------------------------------+------+--------------------------+
|  1 | PRIMARY            | aa    | ALL    | PRIMARY       | NULL       | NULL    | NULL                               |    6 |                          |
|  1 | PRIMARY            | bb    | ref    | functie_i1    | functie_i1 | 4       | bbr.aa.bkwi_persoonsnr_marktselect |    1 | Using where; Using index |
|  1 | PRIMARY            | cc    | eq_ref | PRIMARY       | PRIMARY    | 2       | bbr.bb.code_functie                |    1 |                          |
|  2 | DEPENDENT SUBQUERY | xx    | eq_ref | PRIMARY       | PRIMARY    | 6       | bbr.bb.bkwi_id,bbr.bb.bkwi_volgnr  |    1 | Using where              |
+----+--------------------+-------+--------+---------------+------------+---------+------------------------------------+------+--------------------------+

The problem / inefficiency is that  first all joins are executed and then the where clause.

Correct order is:
- join aa with bb
- do dependent subquery
- join remaining rows with cc

Suggested fix:
The where clause information should be used during or in between the joins.

A filter should be used by the optimizer as soon as possible.

Thank you for a problem report. Why do you think that your suggested plan will be better? Can you, please, provide a set of table definitions and data to prove your idea?

===========================
= data
===========================

create table t1 (id integer);
create table t2 (id integer, t1_id integer);
create table t3 (id integer, t1_id integer, zipcode integer);
create table t4 (id integer, t1_id integer);
create table t5 (id integer, t1_id integer, name varchar(10));

alter table t1 add primary key (id);
alter table t2 add primary key (id);
alter table t3 add primary key (id);
alter table t4 add primary key (id);
alter table t5 add primary key (id);

alter table t2 add index (t1_id);
alter table t3 add index (t1_id);
alter table t4 add index (t1_id);
alter table t5 add index (t1_id);

insert into t1 (id) values (1);
insert into t1 (id) values (2);
insert into t1 (id) values (3);
insert into t1 (id) values (4);
insert into t1 (id) values (5);

insert into t2 (id, t1_id) values (1, 1);
insert into t2 (id, t1_id) values (2, 2);
insert into t2 (id, t1_id) values (3, 3);
insert into t2 (id, t1_id) values (4, 4);
insert into t2 (id, t1_id) values (5, 5);

insert into t3 (id, t1_id, zipcode) values (1, 1, 1);
insert into t3 (id, t1_id, zipcode) values (2, 2, 1234);
insert into t3 (id, t1_id, zipcode) values (3, 3, 2);
insert into t3 (id, t1_id, zipcode) values (4, 4, 3);
insert into t3 (id, t1_id, zipcode) values (5, 5, 4);

insert into t4 (id, t1_id) values (1, 1);
insert into t4 (id, t1_id) values (2, 2);
insert into t4 (id, t1_id) values (3, 3);
insert into t4 (id, t1_id) values (4, 4);
insert into t4 (id, t1_id) values (5, 5);

insert into t5 (id, t1_id, name) values (1, 1, 'Name company I');
insert into t5 (id, t1_id, name) values (2, 2, 'Name company II');
insert into t5 (id, t1_id, name) values (3, 3, 'Name company III');
insert into t5 (id, t1_id, name) values (4, 4, 'Name company IV');
insert into t5 (id, t1_id, name) values (5, 5, 'Name company V');

commit;

analyze table t1;
analyze table t2;
analyze table t3;
analyze table t4;
analyze table t5;

===========================
= case
===========================

Example why i think it may be better:

==> pseudo select
select t1.id, name
from   t1, t2, t3, t4, t5
where  t1.id = t2.t1_id
and    t1.id = t3.t1_id 
and    t1.id = t4.t1_id 
and    t1.id = t5.t1_id 
and    t3.zipcode = 1234

Suppose t1 is the main table and only one t3 row meets zipcode = 1234.

==> plan choosen by MySQL
+----+-------------+-------+-------+---------------+---------+---------+--------------+------+--------------------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref          | rows | Extra                    |
+----+-------------+-------+-------+---------------+---------+---------+--------------+------+--------------------------+
|  1 | SIMPLE      | t1    | index | PRIMARY       | PRIMARY | 4       | NULL         |    5 | Using index              |
|  1 | SIMPLE      | t2    | ref   | t1_id,t1_id_2 | t1_id   | 5       | bbr.t1.id    |    1 | Using where; Using index |
|  1 | SIMPLE      | t5    | ref   | t1_id         | t1_id   | 5       | bbr.t1.id    |    1 | Using where              |
|  1 | SIMPLE      | t4    | ref   | t1_id         | t1_id   | 5       | bbr.t5.t1_id |    1 | Using where; Using index |
|  1 | SIMPLE      | t3    | ref   | t1_id         | t1_id   | 5       | bbr.t5.t1_id |    1 | Using where              |
+----+-------------+-------+-------+---------------+---------+---------+--------------+------+--------------------------+

It says:
1 t1 --> 5 rows
2 t2 --> 5 rows
3 t5 --> 5 rows
4 t4 --> 5 rows
5 t3 --> 1 row  (where clause is executed here, in last step)

==> a more efficient plan would be
1 t1 --> 5 rows
2 t2 --> 5 rows
3 t3 --> 1 row  (where clause is executed here, as third step)
4 t4 --> 1 row
5 t5 --> 1 row

==> an even better plan
1 t3 --> 1 row (where clause is executed here, as first step)
2 t1 --> 1 row
3 t2 --> 1 row
4 t3 --> 1 row
5 t4 --> 1 row

Thank you for a detailed test case. Here is what I got for your test case on 5.0.21-BK:

mysql> EXPLAIN EXTENDED select t1.id, name from   t1, t2, t3, t4, t5 where  t1.id = t2.t1_id and    t1.id = t3.t1_id  and    t1.id = t4.t1_id  and    t1.id =
t5.t1_id  and    t3.zipcode = 1234;
+----+-------------+-------+-------+---------------+---------+---------+---------------+------+--------------------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref       | rows | Extra                    |
+----+-------------+-------+-------+---------------+---------+---------+---------------+------+--------------------------+
| 1  | SIMPLE      | t1    | index | PRIMARY       | PRIMARY | 4       |       | 5    | Using index              |
| 1  | SIMPLE      | t5    | ref   | t1_id         | t1_id   | 5       | test.t1.id    | 1    | Using where              |
| 1  | SIMPLE      | t3    | ref   | t1_id         | t1_id   | 5       | test.t5.t1_id | 1    | Using where              |
| 1  | SIMPLE      | t4    | ref   | t1_id         | t1_id   | 5       | test.t5.t1_id | 1    | Using where; Using index |
| 1  | SIMPLE      | t2    | ref   | t1_id         | t1_id   | 5       | test.t5.t1_id | 1    | Using where; Using index |
+----+-------------+-------+-------+---------------+---------+---------+---------------+------+--------------------------+
5 rows in set, 1 warning (0.01 sec)

mysql> show warnings \G
*************************** 1. row ***************************
  Level: Note
   Code: 1003
Message: select `test`.`t1`.`id` AS `id`,`test`.`t5`.`name` AS `name` from `test
`.`t1` join `test`.`t2` join `test`.`t3` join `test`.`t4` join `test`.`t5` where
 ((`test`.`t3`.`zipcode` = 1234) and (`test`.`t5`.`t1_id` = `test`.`t1`.`id`) an
d (`test`.`t3`.`t1_id` = `test`.`t1`.`id`) and (`test`.`t4`.`t1_id` = `test`.`t1
`.`id`) and (`test`.`t2`.`t1_id` = `test`.`t1`.`id`))
1 row in set (0.01 sec)

mysql> select version();
+-----------+
| version() |
+-----------+
| 5.0.21    |
+-----------+
1 row in set (0.00 sec)

So, according to "rows" column (not "key_len"), you'll have to select 5 primary key values from t1, and then 1 row from each of other tables, by ref, using index. It is absolutely optimal. You have no index on t3.zipcode, so you will need to read 5 rows from this table, according to your idea to use it first!

If you add the index, it will be used:

mysql> alter table t3 add index(zipcode);
Query OK, 5 rows affected (0.03 sec)
Records: 5  Duplicates: 0  Warnings: 0

mysql> EXPLAIN EXTENDED select t1.id, name from   t1, t2, t3, t4, t5 where  t1.id = t2.t1_id and    t1.id = t3.t1_id  and    t1.id = t4.t1_id  and    t1.id =
t5.t1_id  and    t3.zipcode = 1234;
+----+-------------+-------+--------+---------------+---------+---------+---------------+------+--------------------------+
| id | select_type | table | type   | possible_keys | key     | key_len | ref        | rows | Extra                    |
+----+-------------+-------+--------+---------------+---------+---------+---------------+------+--------------------------+
| 1  | SIMPLE      | t3    | ref    | t1_id,zipcode | zipcode | 5       | const        | 1    | Using where              |
| 1  | SIMPLE      | t1    | eq_ref | PRIMARY       | PRIMARY | 4       | test.t3.t1_id | 2    | Using index              |
| 1  | SIMPLE      | t5    | ref    | t1_id         | t1_id   | 5       | test.t1.id    | 1    | Using where              |
| 1  | SIMPLE      | t2    | ref    | t1_id         | t1_id   | 5       | test.t5.t1_id | 1    | Using where; Using index |
| 1  | SIMPLE      | t4    | ref    | t1_id         | t1_id   | 5       | test.t2.t1_id | 1    | Using where; Using index |
+----+-------------+-------+--------+---------------+---------+---------+---------------+------+--------------------------+
5 rows in set, 1 warning (0.01 sec)

So, I do not agree that your test case demonstrates any optimizer bug. Please, check.

Yes the example is not correct, sorry.

Here a better test query:
explain
select count(*) 
from   t1, t2, t3, t4
where  t1.id = t2.t1_id
and    t1.id = t3.t1_id 
and    t1.id = t4.t1_id 
and    exists (
              select ''
              from   t5
              where  t1.id = t5.t1_id 
              );

+----+--------------------+-------+-------+---------------+---------+---------+---------------------+------+--------------------------+
| id | select_type        | table | type  | possible_keys | key     | key_len | ref                 | rows | Extra                    |
+----+--------------------+-------+-------+---------------+---------+---------+---------------------+------+--------------------------+
|  1 | PRIMARY            | t1    | index | PRIMARY       | PRIMARY | 4       | NULL                |    5 | Using where; Using index |
|  1 | PRIMARY            | t2    | ref   | t1_id         | t1_id   | 5       | suwinetinkijk.t1.id |    1 | Using where; Using index |
|  1 | PRIMARY            | t4    | ref   | t1_id         | t1_id   | 5       | suwinetinkijk.t1.id |    1 | Using where; Using index |
|  1 | PRIMARY            | t3    | index | t1_id         | t1_id   | 5       | NULL                |    3 | Using where; Using index |
|  2 | DEPENDENT SUBQUERY | t5    | ref   | t1_id         | t1_id   | 5       | suwinetinkijk.t1.id |    1 | Using where; Using index |
+----+--------------------+-------+-------+---------------+---------+---------+---------------------+------+--------------------------+

The DEPENDENT SUBQUERY is used last and in this example it should be executed second.

So with this better example my case still stands.

Why do you think the DEPENDENT SUBQUERY (EXISTS!) should be executed second? Only 1 row is selected from t5, anyway, by ref. Why optimizer should think that there are more chances to (NOT) find it than to (NOT) find the approipriate rows to JOIN in other tables? Sorry, I do not get it. Please, explain.

New testdata:
drop table t1;
drop table t2;
drop table t3;

create table t1 (id integer);
create table t2 (id integer, t1_id integer);
create table t3 (id integer, t2_id integer, zipcode integer);

alter table t1 add primary key (id);
alter table t2 add primary key (id);
alter table t3 add primary key (id);

alter table t2 add index (t1_id);
alter table t3 add index (t2_id);

insert into t1 (id) values (1);
insert into t1 (id) values (2);

insert into t2 (id, t1_id) values (1, 1);
insert into t2 (id, t1_id) values (2, 1);
insert into t2 (id, t1_id) values (3, 1);
insert into t2 (id, t1_id) values (4, 1);
insert into t2 (id, t1_id) values (5, 2);

insert into t3 (id, t2_id, zipcode) values (1, 2, 3);
insert into t3 (id, t2_id, zipcode) values (2, 1, 7);
insert into t3 (id, t2_id, zipcode) values (3, 1, 7);
insert into t3 (id, t2_id, zipcode) values (4, 1, 7);
insert into t3 (id, t2_id, zipcode) values (6, 1, 7);
insert into t3 (id, t2_id, zipcode) values (7, 1, 7);
insert into t3 (id, t2_id, zipcode) values (8, 1, 7);
insert into t3 (id, t2_id, zipcode) values (9, 1, 7);

commit;

analyze table t1;
analyze table t2;
analyze table t3;

New query:
select count(*) 
from   t1, t2, t3
where  t1.id    = t2.t1_id
and    t2.t1_id = t3.t2_id
and    t3.zipcode = 3;

+----+-------------+-------+-------+---------------+---------+---------+---------------------+------+--------------------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref                 | rows | Extra                    |
+----+-------------+-------+-------+---------------+---------+---------+---------------------+------+--------------------------+
|  1 | SIMPLE      | t1    | index | PRIMARY       | PRIMARY | 4       | NULL                |    2 | Using index              |
|  1 | SIMPLE      | t2    | ref   | t1_id         | t1_id   | 5       | suwinetinkijk.t1.id |    3 | Using where; Using index |
|  1 | SIMPLE      | t3    | ALL   | t2_id         | NULL    | NULL    | NULL                |    6 | Using where              |
+----+-------------+-------+-------+---------------+---------+---------+---------------------+------+--------------------------+

The result is 1 row.

What MySQL does is join t1 with t2 with t3 and then evaluates the where clause.

More efficient is to evaluate a where clause as soon as possible because less rows have to be joined. The intermediate resultset which is then filtered can be joined with the tables less.

In this case:
- MySQL detects a filter on t3
- knows it can be executed in this query in the first step:
  1) get row from t3:
    2) do where clause on this one row
      3) if row is OK then join with t2 and t1
          - goto 1) and get the next row 
      4) if row is NOT OK then don't join 
          - goto 1) and get the next row 

Following this plan:
- 9 rows would be fetched from t3
- only 1 applies and has to be joined with t2 and t3

The sooner in the execution plan a where condition is applied the better it is.
Especially true when joining multiple large tables and a there is a where clause  which really limits the resultset.

Still not a bug. In your last test case:

mysql> explain select count(*)  from   t1, t2, t3 where  t1.id    = t2.t1_id and    t2.t1_id = t3.t2_id and    t3.zipcode = 3\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
         type: index
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 2
        Extra: Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
         type: ref
possible_keys: t1_id
          key: t1_id
      key_len: 5
          ref: test.t1.id
         rows: 3
        Extra: Using where; Using index
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: t3
         type: ALL
possible_keys: t2_id
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 6
        Extra: Using where
3 rows in set (0.00 sec)

You got the plan above (t3 in NOT first table to start with) because there is no index on zipcode, so, if you'll start with t3 you'll have to read it ALL, and what is the difference then? If you add a proper index, plan will change according to your idea:

mysql> alter table t3 add index zip(zipcode);
Query OK, 8 rows affected (0.02 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> explain select count(*)  from   t1, t2, t3 where  t1.id    = t2.t1_id an
d    t2.t1_id = t3.t2_id and    t3.zipcode = 3\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t3
         type: ref
possible_keys: t2_id,zip
          key: zip
      key_len: 5
          ref: const
         rows: 1
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: test.t3.t2_id
         rows: 2
        Extra: Using index
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
         type: ref
possible_keys: t1_id
          key: t1_id
      key_len: 5
          ref: test.t1.id
         rows: 3
        Extra: Using where; Using index
3 rows in set (0.00 sec)

I'm glad we finally agree that with this table and indexes this query could be executed more efficiently.

Your suggestion is to add an index.
I or any other developer won't create an index on every column of a table to support variantions of this query!!! MySQL should be able to solve these kind of very COMMON queries more efficiently.

It should query as shown in your explain plan. First read table t3 (if no index use a full table scan). Suppose only 1% of the rows in t3 has the zipcode we are looking for the query could win 99% in speed.

If this is not a bug then it certainly is a performance enhancement request.

Thank you for persistence. You a right (I had to think for a while to agree). This:

"First read table t3 (if no index use a full table scan). Suppose only 1% of the rows in t3 has the zipcode we are looking for the query could win 99% in speed."

is true in case of _nested loops_ join is performed and _more than one row_ from other tables satisfies query conditions (as in your example). When performing query that way you'll have to scan table t3 only once. When t3 is the last, you'll have to scan it more than once! Optimizer should be able to find a best way even without indexes! It is a (serious) bug.

Thanks for your patience!
Ideally the optimizer should recognize this situation.
In cases where this is not possible for the optimizer todo so or the developer has a different opinion it should be possible to force MySQL, hopefully the STRAIGHT_JOIN hint will do the job.

This performance problem will be fixed in the frame of works for subquery optimizations currently planned for 5.2.

In a related situation, the optimizer (5.0.41) chooses an efficient query plan when LIKE is used with no wildcards or a wildcard at the right end of the string in a WHERE clause, but a very inefficient plan when the wildcard is at the left end.

Suppose there is the following query, where tables 1 through N are large:

SELECT field1, field2 -- etc.
FROM table1
INNER JOIN table2 ON table1.id = table2.id -- etc.
INNER JOIN tableN ON tableN_1.id = tableN.id
WHERE tableN.field1 LIKE 'value%' -- or 'value'

In this case the JOIN for tableN, and with it the WHERE, is evaluated first, as I was expecting, which results in a very fast query.  When the value to be selected is '%value' or '%value%', however, the the JOIN is moved to the end of the query plan, even though I assume it would have saved time in this instance to do a table scan on tableN if necessary, and that the value could be found in the index at any rate without one.

I hope this is not too different from the current bug to mention here.

I think this problem has been fixed with the semi-join optimizations delivered in MySQL 5.6.
If this is not the case, please re-open the bug report.

The problematic query uses the syntax

... WHERE EXISTS (SELECT * FROM ...)

This syntax is unfortunately not supported in our semi-join enhancements yet, but you may rewrite it into an equivalent IN subquery like this:

... WHERE 1 IN (SELECT 1 FROM ...)

The report also mentions lack of predicate selectivity. If you still have a problem with this, please file a new bug report. Join buffering may also sometimes be used to hide problems with wrong selectivity estimates.