Bug #101272 Join cost in explain format tree is wrong
Submitted: 22 Oct 2020 9:01 Modified: 2 Nov 2020 14:18
Reporter: Xuming Zhang Email Updates:
Status: Won't fix Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:8.0 OS:Any
Assigned to: CPU Architecture:Any

[22 Oct 2020 9:01] Xuming Zhang
Description:
if hash join inner table has single table filter, it will cause hash join has wrong cost.

The right table filter_effect multiplied 2 times, one is in right table filter iterator, the other is in hash join iterator.

How to repeat:
create table t2(id int, val int);

insert into t2 values(rand() * 1000, rand() * 1000);

insert into t2 select rand() * 1000, rand() * 1000 from t2;
insert into t2 select rand() * 1000, rand() * 1000 from t2;
insert into t2 select rand() * 1000, rand() * 1000 from t2;
insert into t2 select rand() * 1000, rand() * 1000 from t2;
insert into t2 select rand() * 1000, rand() * 1000 from t2;
insert into t2 select rand() * 1000, rand() * 1000 from t2;
insert into t2 select rand() * 1000, rand() * 1000 from t2;
insert into t2 select rand() * 1000, rand() * 1000 from t2;
insert into t2 select rand() * 1000, rand() * 1000 from t2;
insert into t2 select rand() * 1000, rand() * 1000 from t2;

explain format=tree select * from t2 straight_join t2 tt2 where t2.id=tt2.id and tt2.val between 1 and 100 \G

EXPLAIN: -> Inner hash join (tt2.id = t2.id)  (cost=1362.31 rows=129)
    -> Filter: (tt2.val between 1 and 100)  (cost=0.09 rows=11)
        -> Table scan on tt2  (cost=0.09 rows=1024)
    -> Hash
        -> Table scan on t2  (cost=103.40 rows=1024)

explain format=tree select * from t2 join t2 tt2 where t2.id=tt2.id and tt2.val between 1 and 100 \G

EXPLAIN: -> Inner hash join (t2.id = tt2.id)  (cost=11754.08 rows=11650)
    -> Table scan on t2  (cost=0.10 rows=1024)
    -> Hash
        -> Filter: (tt2.val between 1 and 100)  (cost=103.40 rows=114)
            -> Table scan on tt2  (cost=103.40 rows=1024)

explain format=tree select * from t2 straight_join t2 tt2 where t2.id=tt2.id and tt2.val \G

EXPLAIN: -> Inner hash join (tt2.id = t2.id)  (cost=104962.02 rows=104858)
    -> Table scan on tt2  (cost=0.01 rows=1024)
    -> Hash
        -> Table scan on t2  (cost=103.40 rows=1024)
[22 Oct 2020 13:56] MySQL Verification Team
Hi Mr. Zhang,

Thank you for your bug report.

I have repeated your test case. My numbers do differ from yours, since I have used 8.0.22, but the overall conclusion is the same:

*************************** 1. row ***************************
EXPLAIN: -> Inner hash join (tt2.id = t2.id)  (cost=1362.10 rows=129)
    -> Filter: (tt2.val between 1 and 100)  (cost=0.09 rows=11)
        -> Table scan on tt2  (cost=0.09 rows=1024)
    -> Hash
        -> Table scan on t2  (cost=102.65 rows=1024)

*************************** 1. row ***************************
EXPLAIN: -> Inner hash join (t2.id = tt2.id)  (cost=11752.58 rows=11650)
    -> Table scan on t2  (cost=0.09 rows=1024)
    -> Hash
        -> Filter: (tt2.val between 1 and 100)  (cost=102.65 rows=114)
            -> Table scan on tt2  (cost=102.65 rows=1024)

*************************** 1. row ***************************
EXPLAIN: -> Inner hash join (tt2.id = t2.id)  (cost=9550.69 rows=8493)
    -> Filter: (0 <> tt2.val)  (cost=0.02 rows=92)
        -> Table scan on tt2  (cost=0.02 rows=1024)
    -> Hash
        -> Table scan on t2  (cost=102.65 rows=1024)

Verified as reported.
[2 Nov 2020 14:18] Erlend Dahl
Comment from hash join developer:

[30 Oct 2020 6:38] Steinar Gunderson

This is true, but it's also not fixable. The cost estimates come from the old join optimizer, which doesn't understand hash join, and doesn't understand that we can push the filter down into the hash table. Thus, the costs will be wrong indeed.