MySQL Bugs: #20199: Join slower than exists

Bug #20199	Join slower than exists
Submitted:	1 Jun 2006 13:11	Modified:	8 Dec 2009 16:04
Reporter:	Andre Timmer	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S5 (Performance)
Version:	5.0.18, 6.0.14	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Join slower than exists.
It gets really worth for tables with many rows and many columns.

The MySQL server could execute queries like this more efficient.

How to repeat:
delimiter ;

drop table if exists tab1;
drop table if exists tab2;

create table tab1 (id integer, col2 varchar(200), primary key (id));
create table tab2 (id integer, col2 varchar(200), primary key (id));

drop procedure if exists fill_tab;

delimiter //

create procedure fill_tab()
begin
  declare v_max    int default 1000 * 1000;
  declare v_commit int default 1000;
  declare v_cnt    int default 0;
  
  l:loop
    set v_cnt = v_cnt + 1;
    if (v_cnt > v_max) then leave l; end if;
    
    if (v_cnt/1000 = v_commit) then  commit; end if;
    
    insert into tab1 (id, col2) values (v_cnt, 'ddddddddddddddkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkeeekkkkkkkkkkkkkk');
    insert into tab2 (id, col2) values (v_cnt, 'ddddddddddddddkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkeeekkkkkkkkkkkkkk');
  end loop;
  
  commit;
  
end;
//

delimiter ;

call fill_tab();

analyze table tab1;
analyze table tab2;

select count(*) from tab1;
select count(*) from tab2;

explain
select count(*)
from   tab1 aa
,      tab2 bb
where  aa.id = bb.id;

explain
select count(*)
from   tab1 aa
where  exists (
              select '' 
              from   tab2 bb
              where  aa.id = bb.id
              );         
              
select count(*)
from   tab1 aa
,      tab2 bb
where  aa.id = bb.id;

select count(*)
from   tab1 aa
where  exists (
              select '' 
              from   tab2 bb
              where  aa.id = bb.id
              );         

------

The first query takes 30s, second 35s.
With the same kind of tables but more data the difference is 4 minutes and 10 minutes on our system!!!

It should not be necessairy to rewrite the queries.

Suggested fix:
Optimizer should know, based on meta data, that a joining one row will result in at most 1 row.

So the plan could be:
- recognizing it's a special join, at most one row
- recognizing count(*) is the same as count(tab1.id) because this is a NOT null column

  - read rows from tab1
    - search if there is a corresponding row in tab2 BUT don't add the row to the internal result, this is not necessairy

Thanks for your bug report.  I tested this under FC5:

mysql> select count(*) from tab1;
+----------+
| count(*) |
+----------+
|   529911 | 
+----------+
1 row in set (0.00 sec)

mysql> select count(*) from tab2;
+----------+
| count(*) |
+----------+
|   529910 | 
+----------+
1 row in set (0.00 sec)

mysql> 
mysql> explain
    -> select count(*)
    -> from   tab1 aa
    -> ,      tab2 bb
    -> where  aa.id = bb.id;
+----+-------------+-------+--------+---------------+---------+---------+------------+--------+-------------+
| id | select_type | table | type   | possible_keys | key     | key_len | ref        | rows   | Extra       |
+----+-------------+-------+--------+---------------+---------+---------+------------+--------+-------------+
|  1 | SIMPLE      | bb    | index  | PRIMARY       | PRIMARY | 4       | NULL       | 529910 | Using index | 
|  1 | SIMPLE      | aa    | eq_ref | PRIMARY       | PRIMARY | 4       | test.bb.id |      1 | Using index | 
+----+-------------+-------+--------+---------------+---------+---------+------------+--------+-------------+
2 rows in set (0.00 sec)

mysql> explain
    -> select count(*)
    -> from   tab1 aa
    -> where  exists (
    ->               select '' 
    ->               from   tab2 bb
    ->               where  aa.id = bb.id
    ->               );       
+----+--------------------+-------+--------+---------------+---------+---------+------------+--------+--------------------------+
| id | select_type        | table | type   | possible_keys | key     | key_len | ref        | rows   | Extra                    |
+----+--------------------+-------+--------+---------------+---------+---------+------------+--------+--------------------------+
|  1 | PRIMARY            | aa    | index  | NULL          | PRIMARY | 4       | NULL       | 613689 | Using where; Using index | 
|  2 | DEPENDENT SUBQUERY | bb    | eq_ref | PRIMARY       | PRIMARY | 4       | test.aa.id |      1 | Using index              | 
+----+--------------------+-------+--------+---------------+---------+---------+------------+--------+--------------------------+
2 rows in set (0.01 sec)

mysql>               
mysql> select count(*)
    -> from   tab1 aa
    -> ,      tab2 bb
    -> where  aa.id = bb.id;
+----------+
| count(*) |
+----------+
|   621794 | 
+----------+
1 row in set (7.06 sec)

mysql> select count(*)
    -> from   tab1 aa
    -> where  exists (
    ->               select '' 
    ->               from   tab2 bb
    ->               where  aa.id = bb.id
    ->               );         

+----------+
| count(*) |
+----------+
|   718545 | 
+----------+
1 row in set (8.51 sec)

mysql>

I reproduced this using 5.0.23bk

my results have a ~1.5 sec difference, but the difference is still considerable between both queries.

This performance problem will be fixed in the frame of WL #2980 "Subquery optimizations".

With 6.0.14 I've got the same results as Jorge long time ago:

mysql> select count(*) from tab1;
+----------+
| count(*) |
+----------+
|  1000000 |
+----------+
1 row in set (1.00 sec)

mysql> select count(*) from tab2;
+----------+
| count(*) |
+----------+
|  1000000 |
+----------+
1 row in set (0.19 sec)

mysql> 
mysql> explain
    -> select count(*)
    -> from   tab1 aa
    -> ,      tab2 bb
    -> where  aa.id = bb.id;
+----+-------------+-------+--------+---------------+---------+---------+------------+---------+-------------+
| id | select_type | table | type   | possible_keys | key     | key_len | ref        | rows    | Extra       |
+----+-------------+-------+--------+---------------+---------+---------+------------+---------+-------------+
|  1 | SIMPLE      | aa    | index  | PRIMARY       | PRIMARY | 4       | NULL       | 1000000 | Using index |
|  1 | SIMPLE      | bb    | eq_ref | PRIMARY       | PRIMARY | 4       | test.aa.id |       1 | Using index |
+----+-------------+-------+--------+---------------+---------+---------+------------+---------+-------------+
2 rows in set (0.00 sec)

mysql> explain
    -> select count(*)
    -> from   tab1 aa
    -> where  exists (
    ->               select '' 
    ->               from   tab2 bb
    ->               where  aa.id = bb.id
    ->               );    
+----+--------------------+-------+--------+---------------+---------+---------+------------+---------+--------------------------+
| id | select_type        | table | type   | possible_keys | key     | key_len | ref        | rows    | Extra                    |
+----+--------------------+-------+--------+---------------+---------+---------+------------+---------+--------------------------+
|  1 | PRIMARY            | aa    | index  | NULL          | PRIMARY | 4       | NULL       | 1000000 | Using where; Using index |
|  2 | DEPENDENT SUBQUERY | bb    | eq_ref | PRIMARY       | PRIMARY | 4       | test.aa.id |       1 | Using index              |
+----+--------------------+-------+--------+---------------+---------+---------+------------+---------+--------------------------+
2 rows in set (0.00 sec)

mysql> select count(*)
    -> from   tab1 aa
    -> ,      tab2 bb
    -> where  aa.id = bb.id;
+----------+
| count(*) |
+----------+
|  1000000 |
+----------+
1 row in set (22.04 sec)

mysql> select count(*)
    -> from   tab1 aa
    -> where  exists (
    ->               select '' 
    ->               from   tab2 bb
    ->               where  aa.id = bb.id
    ->               );         
+----------+
| count(*) |
+----------+
|  1000000 |
+----------+
1 row in set (27.63 sec)

mysql> select version();
+--------------------+
| version()          |
+--------------------+
| 6.0.14-alpha-debug |
+--------------------+
1 row in set (0.00 sec)

So, join is faster(!). What exactly is a problem here, other than a request for further, more exotic optimization maybe?

Thank you for testing.
Let's say it's indeed a wish for an exotic tuning optimization.
Maybe good idea to close this call.

As join is faster than subquery with exists now, I'd say that optimizer should be made clever enough, eventually, to rewrite the query using join (or, at least, consider this way of optimization among others).

Duplicate of bug#24972 (based on Valeriy's comment above