Description:
Cost estimates for duplicate weed-out strategy for semijoin seems to be wrong.
The code looks as follows:
double write_cost= join->positions[first_tab].prefix_record_count*
sj_outer_fanout * one_lookup_cost;
double full_lookup_cost= join->positions[first_tab].prefix_record_count*
sj_outer_fanout* sj_inner_fanout *
one_lookup_cost;
The problem is that the number of rows in first_tab is included in both
join->positions[first_tab].prefix_record_count and in sj_inner_fanout.
Hence the number of rows becomes too high. What we want is the number of
rows from the the prefix before first_tab.
This run inidicates that the cost for dup-weedout is too high:
create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1 (a int, b int);
insert into t1 select a,a from t0;
create table t2 (a int, b int);
insert into t2 select A.a + 10*B.a, A.a + 10*B.a from t0 A, t0 B;
explain extended select * from t1 where (a,b) in (select a,b from t2);
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY t1 ALL NULL NULL NULL NULL 10 100.00
1 PRIMARY t2 ALL NULL NULL NULL NULL 100 100.00 Using where; FirstMatch(t1)
Warnings:
Note 1003 select `test`.`t1`.`a` AS `a`,`test`.`t1`.`b` AS `b` from `test`.`t1` semi join (`test`.`t2`) where ((`test`.`t2`.`b` = `test`.`t1`.`b`) and (`test`.`t2`.`a` = `test`.`t1`.`a`))
show session status like 'Last_query_cost';
Variable_name Value
Last_query_cost 6.240699
set @@optimizer_switch='firstmatch=off';
explain extended select * from t1 where (a,b) in (select a,b from t2);
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY t1 ALL NULL NULL NULL NULL 10 100.00
1 PRIMARY t2 ALL NULL NULL NULL NULL 100 100.00 Materialize
Warnings:
Note 1003 select `test`.`t1`.`a` AS `a`,`test`.`t1`.`b` AS `b` from `test`.`t1` semi join (`test`.`t2`) where ((`test`.`t2`.`b` = `test`.`t1`.`b`) and (`test`.`t2`.`a` = `test`.`t1`.`a`))
show session status like 'Last_query_cost';
Variable_name Value
Last_query_cost 11.740699
set @@optimizer_switch='materialization=off';
explain extended select * from t1 where (a,b) in (select a,b from t2);
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY t1 ALL NULL NULL NULL NULL 10 100.00 Start temporary
1 PRIMARY t2 ALL NULL NULL NULL NULL 100 100.00 Using where; End temporary; Using join buffer
Warnings:
Note 1003 select `test`.`t1`.`a` AS `a`,`test`.`t1`.`b` AS `b` from `test`.`t1` semi join (`test`.`t2`) where ((`test`.`t2`.`b` = `test`.`t1`.`b`) and (`test`.`t2`.`a` = `test`.`t1`.`a`))
show session status like 'Last_query_cost';
Variable_name Value
Last_query_cost 5056.240699
drop table t0, t1, t2;
How to repeat:
Run the following, and observer cost estimations:
create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1 (a int, b int);
insert into t1 select a,a from t0;
create table t2 (a int, b int);
insert into t2 select A.a + 10*B.a, A.a + 10*B.a from t0 A, t0 B;
explain extended select * from t1 where (a,b) in (select a,b from t2);
show session status like 'Last_query_cost';
set @@optimizer_switch='firstmatch=off';
explain extended select * from t1 where (a,b) in (select a,b from t2);
show session status like 'Last_query_cost';
set @@optimizer_switch='materialization=off';
explain extended select * from t1 where (a,b) in (select a,b from t2);
show session status like 'Last_query_cost';
Suggested fix:
Change the above code lines to:
double write_cost= prefix_rec_count * sj_outer_fanout * one_lookup_cost;
double full_lookup_cost= write_cost * sj_inner_fanout;
The cost for dup-weedout in the above example will then be 56.740699, which
seems more correct.