MySQL Bugs: #99244: bad performance in hash join because join with no join condition between tables

Bug #99244	bad performance in hash join because join with no join condition between tables
Submitted:	14 Apr 2020 3:22	Modified:	21 Apr 2020 16:24
Reporter:	chunyang xu	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S1 (Critical)
Version:	8.0.19	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	bad performance, Cartesian Product temp result., hash join

Description:
low  performance in hash join because join with no join condition between tables, Temporary rows of Cartesian product quantity causes disk space to run out。 

How to repeat:
create table  t1 ( t1_id varchar(20),t1_name varchar(20) ,t1_addr varchar(20));

create table  t2 ( t2_id varchar(20),t2_name varchar(20) ,t2_addr varchar(20));

create table  t3 ( t3_id varchar(20),t3_name varchar(20) ,t3_addr varchar(20));

insert into t1  values('a','a','a');

insert into t2  values('a','a','a');

insert into t3  values('a','a','a');

then execute the following SQL:
select * from t1,t2,t3 where t1_id=t2_id and t2_addr=t3_addr ;

the SQL plan as bollowed
 -> Inner hash join (t3.t3_addr = t2.t2_addr)  (cost=1.05 rows=1)
    -> Table scan on t3  (cost=0.35 rows=1)
    -> Hash
        -> Inner hash join (t2.t2_id = t1.t1_id)  (cost=0.70 rows=1)
            -> Table scan on t2  (cost=0.35 rows=1)
            -> Hash
                -> Table scan on t1  (cost=0.35 rows=1)
 

the hash join order of SQL plan is "T1->T2->T3" ,it normal join order.

and then , insert  more rows into T2,   and the SQL plan is changed. 

insert into t2  values('a','a','a');

insert into t2  values('a','a','a');

insert into t2  values('a','a','a');

insert into t2  values('a','a','a');

insert into t2  values('a','a','a');

insert into t2  values('a','a','a');

insert into t2  values('a','a','a');

insert into t2  values('a','a','a');

insert into t2  values('a','a','a');

insert into t2  values('a','a','a');

-> Inner hash join (t2.t2_addr = t3.t3_addr), (t2.t2_id = t1.t1_id)  (cost=2.05 rows=1)
    -> Table scan on t2  (cost=0.35 rows=11)
    -> Hash
        -> Inner hash join  (cost=0.70 rows=1)
            -> Table scan on t3  (cost=0.35 rows=1)
            -> Hash
                -> Table scan on t1  (cost=0.35 rows=1)

the join order is "T1->T3->T2", but no join condition between T1  and T3, it will product lots of Temporary rows  , if t1 and t3  both have much rows , it causes disk space to run out。

Suggested fix:
Avoid joins without join conditions between more that two tables join

Hi Mr. xu,

Thank you for your bug report.

I have run your test case, in several different EXPLAIN formats. It is clear to me that you are not actually correct.

In the first case, we have a linear relationship T1 -> T2 -> T3.

This relationship had to change, since you have added rows to T2, which has no indices. That has changed a relationship from a linear one to the hierarchical one. Hence, what you have now is T2 ->T3 and T2->T1, without any interaction between T1 and T3. Looks something like this:

                                              T2
                                             /  \
                                          T3  T1

So, everything is fine and this is not a bug.

hell ,   if you will, you can do a lab like this .

insert into t1 values('a0','name0','addr0'); 
insert into t1 values('a1','name1','addr1'); 
insert into t1 values('a2','name2','addr2'); 
.........

insert into t1 values('a998','name998','addr998'); 
insert into t1 values('a999','name999','addr999'); 

T1 table has  1000  records.

then  insert recort into t2,  50000 records.

insert into t2 values('a0','name0','addr0'); 
insert into t2 values('a1','name1','addr1'); 
insert into t2 values('a2','name2','addr2'); 
insert into t2 values('a3','name3','addr3'); 
insert into t2 values('a4','name4','addr4'); 
....................................
......................
insert into t2 values('a49997','name49997','addr49997'); 
insert into t2 values('a49998','name49998','addr49998'); 
insert into t2 values('a49999','name49999','addr49999'); 

and then  for T3,  total 4000 record.

insert into t3 values('a0','name0','addr0'); 
insert into t3 values('a1','name1','addr1'); 
insert into t3 values('a2','name2','addr2'); 
insert into t3 values('a3','name3','addr3'); 
insert into t3 values('a4','name4','addr4'); 
insert into t3 values('a5','name5','addr5'); 
..................................
............................
insert into t3 values('a3997','name3997','addr3997'); 
insert into t3 values('a3998','name3998','addr3998'); 
insert into t3 values('a3999','name3999','addr3999');

then   test the following SQL

first SQL
select count(*) from  t1 straight_join t2 straight_join t3 where t1_id=t2_id and t2_addr=t3_addr ;

mysql> select count(*) from  t1 straight_join t2 straight_join t3 where t1_id=t2_id and t2_addr=t3_addr ;
+----------+
| count(*) |
+----------+
|     1021 |
+----------+
1 row in set (1.57 sec)

secord SQL

select count(*) from  t1 , t2 ,t3 where t1_id=t2_id and t2_addr=t3_addr ;

must wait a long time before the result come out.

Here is a testcase:
--
drop table if exists t1,t2,t3;
create table  t1 ( t1_id varchar(20),t1_name varchar(20) ,t1_addr varchar(20));
create table  t2 ( t2_id varchar(20),t2_name varchar(20) ,t2_addr varchar(20));
create table  t3 ( t3_id varchar(20),t3_name varchar(20) ,t3_addr varchar(20));
insert into t1 values('a0','name0','addr0'),('a1','name1','addr1'),('a2','name2','addr2'); 
set @n:=3;
insert into t1 select concat('a',@n:=@n+1),concat('name',@n),concat('addr',@n) from t1 a,t1 b,t1 c,t1 d,t1 e;
insert into t1 select concat('a',@n:=@n+1),concat('name',@n),concat('addr',@n) from t1 a,t1 b,t1 c limit 9150;
select count(*) from t1;

insert into t2 values('a0','name0','addr0'),('a1','name1','addr1'),('a2','name2','addr2'); 
set @n:=3;
insert into t2 select concat('a',@n:=@n+1),concat('name',@n),concat('addr',@n) from t2 a,t2 b,t2 c,t2 d,t2 e;
insert into t2 select concat('a',@n:=@n+1),concat('name',@n),concat('addr',@n) from t2 a,t2 b,t2 c limit 497540;
select count(*) from t2;

insert into t3 values('a0','name0','addr0'),('a1','name1','addr1'),('a2','name2','addr2'); 
set @n:=3;
insert into t3 select concat('a',@n:=@n+1),concat('name',@n),concat('addr',@n) from t3 a,t3 b,t3 c,t3 d,t3 e;
insert into t3 select concat('a',@n:=@n+1),concat('name',@n),concat('addr',@n) from t3 a,t3 b,t3 c limit 37540;
select count(*) from t3;

analyze table t1,t2,t3;

explain select count(*) from  t1 straight_join t2 straight_join t3 where t1_id=t2_id and t2_addr=t3_addr ;
explain analyze select count(*) from  t1 straight_join t2 straight_join t3 where t1_id=t2_id and t2_addr=t3_addr ;
flush status;
select count(*) from  t1 straight_join t2 straight_join t3 where t1_id=t2_id and t2_addr=t3_addr ;
show status like 'handler%';

explain select count(*) from  t1 , t2 ,t3 where t1_id=t2_id and t2_addr=t3_addr ;
explain analyze select count(*) from  t1 , t2 ,t3 where t1_id=t2_id and t2_addr=t3_addr ;
flush status;
select count(*) from  t1 , t2 ,t3 where t1_id=t2_id and t2_addr=t3_addr ;
show status like 'handler%';

---

I will attach the outputs shortly.

output of testcase on 8.0.19

Attachment: o.txt (text/plain), 8.61 KiB.

what the "perf" looks like for the explain analyze and select queries

Attachment: perf_profile_8.0.19.txt (text/plain), 18.31 KiB.