Bug #41220 LEFT JOIN with compound clause uses table scan, while INNER JOIN uses indexes
Submitted: 4 Dec 2008 9:59 Modified: 18 Aug 2009 8:43
Reporter: Geoff Winkless Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S5 (Performance)
Version:5.1.30, 5.0, 5.1, azalea bzr OS:Any
Assigned to: CPU Architecture:Any

[4 Dec 2008 9:59] Geoff Winkless
Description:
Attempting to run a LEFT JOIN against two tables with a compound clause (is that the right term?) causes MySQL to run a full cross-join and then a table scan whereas an INNER JOIN copes fine.

I've marked this as "serious" because although technically the query does eventually return the length of time taken makes it effectively unusable and (as far as I can see) there's no workaround.

Exhibited in 5.0.15 (used on a legacy server running Redhat 9) but I ran the same data and query on Windows on 5.1.30 to make sure the bug wasn't already fixed in a later version (or a Linux-only problem!) and got the same result.

What surprised me is even if the indexes can't be used because of the compound JOIN, the WHERE on t1.dt could be used as a selector to bring down the number of records before the cross-join.

mysql> explain SELECT t1id FROM t1 INNER JOIN t2 ON t1.f1=t2.f1 OR t1.f2=t2.f1 WHERE t1.dt=20070627;
+----+-------------+-------+------+-----------------------------------------+---------------------+---------+-------+--------+------------------------------------------------+
| id | select_type | table | type | possible_keys         | key        | key_len | ref   | rows   | Extra                                          |
+----+-------------+-------+------+-----------------------+------------+-------+-------+--------+------------------------------------------------+
|  1 | SIMPLE      | t1    | ref  | idx_t1dtf1,idx_t1dtf2 | idx_t1dtf1 | 5       | const |   1724 | Using where                                    |
|  1 | SIMPLE      | t2    | ALL  | idx_t2f1              | NULL       | NULL    | NULL  | 144547 | Range checked for each record (index map: 0x2) |
+----+-------------+-------+------+-----------------------+------------+-------+-------+--------+------------------------------------------------+
2 rows in set (0.00 sec)
mysql> explain SELECT t1id FROM t1 LEFT JOIN t2 ON t1.f1=t2.f1 OR t1.f2=t2.f1 WHERE t1.dt=20070627;
+----+-------------+-------+-------+-----------------------+------------+---------+-------+--------+-------------+
| id | select_type | table | type  | possible_keys         | key        | key_len | ref   | rows   | Extra       |
+----+-------------+-------+-------+-----------------------+------------+---------+-------+--------+-------------+
|  1 | SIMPLE      | t1    | ref   | idx_t1dtf1,idx_t1dtf2 | idx_t1dtf1 | 5       | const |   1724 | Using where |
|  1 | SIMPLE      | t2    | index | idx_t2f1              | idx_t2f1   | 5       | NULL  | 144547 | Using index |
+----+-------------+-------+-------+-----------------------+------------+---------+-------+--------+-------------+
2 rows in set (0.00 sec)

Apologies if this is a dupe; I did search for the terms I could think of but I'm not 100% sure whether I used the right search terms (is it a compound join?)

How to repeat:
Create two tables

CREATE TABLE `t1` (
  `t1id` int(11) NOT NULL DEFAULT '0',
  `dt` int(11) DEFAULT NULL,
  `f1` int(11) DEFAULT NULL,
  `f2` int(11) DEFAULT NULL,
  PRIMARY KEY (`t1id`),
  KEY `idx_t1dtf1` (`dt`,`f1`),
  KEY `idx_t1dtf2` (`dt`,`f2`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
CREATE TABLE `t2` (
  `t2id` int(11) NOT NULL DEFAULT '0',
  `f1` int(11) DEFAULT NULL,
  PRIMARY KEY (`t2id`),
  KEY `idx_t2f1` (`f1`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1

Fill t1.f1, t2.f1 and t2.f2 with a medium number of records (in my case the set was 144547 and 196093). They can be random values, although t1.dt should have a couple of thousand with the value 1000 (don't care about the rest). Then run

SELECT t1id FROM t1 LEFT JOIN t2 ON t1.f1=t2.f1 OR t1.f2=t2.f1 WHERE t1.dt=1000

You should see something like this in the slow query log:

# Query_time: 227.989358  Lock_time: 0.000000 Rows_sent: 2878  Rows_examined: 285771396

Note that Rows_examined is count(t1) * count(t2).

Whereas running 

SELECT t1id FROM t1 INNER JOIN t2 ON t1.f1=t2.f1 OR t1.f2=t2.f1 WHERE t1.dt=1000

takes under a second.
[4 Dec 2008 10:38] Geoff Winkless
For what it's worth, there is a workaround but it's tortuous for any query more complex than this one (as my real-world query will be!):

SELECT t1id FROM t1 LEFT JOIN t2 ON t1.f1=t2.f1 WHERE t1.dt=20070627 UNION DISTINCT SELECT t1id FROM t1 LEFT JOIN t2 ON t1.f2=t2.f1 WHERE t1.dt=20070627;

Ideally I'd like the optimizer to do that for me.
[4 Dec 2008 10:39] Giuseppe Maxia
I can reproduce this problem using the test below on 5.0.67 and 5.1.30.

drop table if exists t1,t2;

CREATE TABLE `t1` (
          `t1id` int(11) NOT NULL DEFAULT '0',
          `dt` int(11) DEFAULT NULL,
          `f1` int(11) DEFAULT NULL,
          `f2` int(11) DEFAULT NULL,
          PRIMARY KEY (`t1id`),
          KEY `idx_t1dtf1` (`dt`,`f1`),
          KEY `idx_t1dtf2` (`dt`,`f2`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `t2` (
          `t2id` int(11) NOT NULL DEFAULT '0',
          `f1` int(11) DEFAULT NULL,
          PRIMARY KEY (`t2id`),
          KEY `idx_t2f1` (`f1`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

drop procedure if exists p1;

delimiter //

create procedure p1()
deterministic
begin
    declare i int default 0;
    while i < 30000 do
        insert into t1 values (i, if(mod(i,1000) = 0,1000,rand()*i*100), rand() *100, rand()*100);
        insert into t2 values (i, i);
        set i = i + 1;
    end while;
end//

delimiter ;

call p1();

explain SELECT count(t1id) FROM t1 LEFT JOIN t2 ON t1.f1=t2.f1 OR t1.f2=t2.f1 WHERE t1.dt=1000\G
explain SELECT count(t1id) FROM t1 inner JOIN t2 ON t1.f1=t2.f1 OR t1.f2=t2.f1 WHERE t1.dt=1000\G

SELECT count(t1id) FROM t1 inner JOIN t2 ON t1.f1=t2.f1 OR t1.f2=t2.f1 WHERE t1.dt=1000;
SELECT count(t1id) FROM t1 LEFT JOIN t2 ON t1.f1=t2.f1 OR t1.f2=t2.f1 WHERE t1.dt=1000;
[4 Dec 2008 11:00] Geoff Winkless
Ignore my last comment - the UNION DISTINCT workaround doesn't work properly because you end up with more records from the UNION than the original JOIN: it only works if you don't select any data from t2.
[4 Dec 2008 14:23] Valeriy Kravchuk
Thank you for a problem report. Please, send the results of

SELECT count(*) FROM t1 WHERE t1.dt=1000;

It looks like server just selects all rows from t1 where dt=1000 (a couple of thousands) using index(!) and for each of these rows any row from t2 fits actually because LEFT JOIN (so t2 is scanned entirely). Hence the number of rows examined.

It looks like NOT a bug for me.
[4 Dec 2008 14:36] Geoff Winkless
Valeriy,

Without wishing to be rude I think you've misread the query.

It's a _left_ join, so it's "2000 rows from table 1 joined to any _matching_ rows (or NULL) from table 2". 

The query you describe would be

SELECT count(t2id) FROM t2 LEFT JOIN t1 ON (t1.f1=t2.f1 OR t1.f2=t2.f1) AND t1.dt=1000

- which would be somewhat bizarre.

Hope I haven't misunderstood you.

Geoff
[4 Dec 2008 15:02] Valeriy Kravchuk
Giuseppe,

Maybe you can explain me where is the bug here? How the LEFT JOIN query should be executed optimally? Can you re-write query or add FORCE INDEX or other hint and get less rows accessed and faster execution?
[4 Dec 2008 16:05] Geoff Winkless
I don't know about the internals of how MySQL works but (from a layman's perspective) what I want it to do is exactly the same as it does for the INNER JOIN (since the INNER JOIN works perfectly) but without removing those rows from t1 that don't match t2.

For what it's worth, using

SELECT t1.t1id, FROM t1 LEFT JOIN (select DISTINCT t2.t2id, t1id FROM t1 INNER JOIN t2 ON t1.f1=t2.f1 OR t1.f2=t2.f1 WHERE t1.dt=1000) AS tmptab ON t1.t1id=tmptab.t1id WHERE t1.dt=1000;

seems to do the trick, although I'm still checking my logic to make sure I haven't messed that up.
[17 Apr 2009 9:18] Geoff Winkless
Any update on this? The workaround in my last comment does work (LEFT joining against an INNER JOIN'd subquery) but ideally we shouldn't have to do this (it becomes very messy with any query more complex than the one listed). 

Is it not an optimizer problem - should it be reassigned to "General"? I couldn't really see any other category to which it fits.

Thanks!
[14 Aug 2009 9:48] Sveta Smirnova
Thank you for the feedback.

Verified as described. Performance degradation more noticeable if modify test so there is no fields with possible NULL:

drop table if exists t1,t2;

CREATE TABLE `t1` (
          `t1id` int(11) NOT NULL DEFAULT '0',
          `dt` int(11) DEFAULT NULL,
          `f1` int(11) NOT NULL,
          `f2` int(11) NOT NULL,
          PRIMARY KEY (`t1id`),
          KEY `idx_t1dtf1` (`dt`,`f1`),
          KEY `idx_t1dtf2` (`dt`,`f2`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `t2` (
          `t2id` int(11) NOT NULL DEFAULT '0',
          `f1` int(11) NOT NULL,
          PRIMARY KEY (`t2id`),
          KEY `idx_t2f1` (`f1`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

drop procedure if exists p1;

delimiter //;

create procedure p1()
deterministic
begin
    declare i int default 0;
    while i < 30000 do
        insert into t1 values (i, if(mod(i,1000) = 0,1000,rand()*i*100), rand() *100,rand()*100);
        insert into t2 values (i, i);
        set i = i + 1;
    end while;
end//

delimiter ;//

call p1();

--vertical_results
explain SELECT count(t1id) FROM t1 LEFT JOIN t2 ON t1.f1=t2.f1 OR t1.f2=t2.f1 WHERE
t1.dt=1000;
explain SELECT count(t1id) FROM t1 INNER JOIN t2 ON t1.f1=t2.f1 OR t1.f2=t2.f1 WHERE
t1.dt=1000;
explain SELECT count(t1id) FROM t1 LEFT JOIN t2 FORCE INDEX(idx_t2f1) ON t1.f1=t2.f1 OR t1.f2=t2.f1 WHERE t1.dt=1000;
explain SELECT count(t1id) FROM t1 LEFT JOIN t2 IGNORE INDEX(idx_t2f1) ON t1.f1=t2.f1 OR t1.f2=t2.f1 WHERE t1.dt=1000;

select now();
SELECT count(t1id) FROM t1 inner JOIN t2 ON t1.f1=t2.f1 OR t1.f2=t2.f1 WHERE t1.dt=1000;
select now();
SELECT count(t1id) FROM t1 LEFT JOIN t2 ON t1.f1=t2.f1 OR t1.f2=t2.f1 WHERE t1.dt=1000;
select now();
SELECT count(t1id) FROM t1 LEFT JOIN t2 FORCE INDEX(idx_t2f1) ON t1.f1=t2.f1 OR t1.f2=t2.f1 WHERE t1.dt=1000;
select now();
SELECT count(t1id) FROM t1 LEFT JOIN t2 IGNORE INDEX(idx_t2f1) ON t1.f1=t2.f1 OR t1.f2=t2.f1 WHERE t1.dt=1000;
select now();

Result:

...
select now();
now()   2009-08-14 12:47:07
SELECT count(t1id) FROM t1 inner JOIN t2 ON t1.f1=t2.f1 OR t1.f2=t2.f1 WHERE t1.dt=1000;
count(t1id)     61
select now();
now()   2009-08-14 12:47:07
SELECT count(t1id) FROM t1 LEFT JOIN t2 ON t1.f1=t2.f1 OR t1.f2=t2.f1 WHERE t1.dt=1000;
count(t1id)     61
select now();
now()   2009-08-14 12:47:10
SELECT count(t1id) FROM t1 LEFT JOIN t2 FORCE INDEX(idx_t2f1) ON t1.f1=t2.f1 OR t1.f2=t2.f1 WHERE t1.dt=1000;
count(t1id)     61
select now();
now()   2009-08-14 12:47:14
SELECT count(t1id) FROM t1 LEFT JOIN t2 IGNORE INDEX(idx_t2f1) ON t1.f1=t2.f1 OR t1.f2=t2.f1 WHERE t1.dt=1000;
count(t1id)     61
select now();
now()   2009-08-14 12:47:26
[18 Aug 2009 8:43] Geoff Winkless
How is it a "feature request" to expect the server to be able to handle a left join on a few decent-sized tables without a >200s wait?

I also can't see how it's a "performance" issue; the speed is slow enough that it might as well not return.

Since I'm working on a legacy system, which is unlikely to receive the benefit of any bugfix in a later version, finding a solution doesn't really affect me; I just thought, as a software developer myself, that you guys might be proud enough of your product to want to make it better. If that's not the case, then good luck to you.

I've reached the point where I feel like I'm flogging a dead horse; as such this will be the last comment from me on this bug, unless I'm asked a specific question.