Description:
Cost estimates for duplicate weed-out strategy for semijoin seems to be wrong.
The code looks as follows:
double write_cost= join->positions[first_tab].prefix_record_count*
sj_outer_fanout * one_lookup_cost;
double full_lookup_cost= join->positions[first_tab].prefix_record_count*
sj_outer_fanout* sj_inner_fanout *
one_lookup_cost;
The problem is that the number of rows in first_tab is included in both
join->positions[first_tab].prefix_record_count and in sj_inner_fanout.
Hence the number of rows becomes too high. What we want is the number of
rows from the the prefix before first_tab.
This run inidicates that the cost for dup-weedout is too high:
create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1 (a int, b int);
insert into t1 select a,a from t0;
create table t2 (a int, b int);
insert into t2 select A.a + 10*B.a, A.a + 10*B.a from t0 A, t0 B;
explain extended select * from t1 where (a,b) in (select a,b from t2);
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY t1 ALL NULL NULL NULL NULL 10 100.00
1 PRIMARY t2 ALL NULL NULL NULL NULL 100 100.00 Using where; FirstMatch(t1)
Warnings:
Note 1003 select `test`.`t1`.`a` AS `a`,`test`.`t1`.`b` AS `b` from `test`.`t1` semi join (`test`.`t2`) where ((`test`.`t2`.`b` = `test`.`t1`.`b`) and (`test`.`t2`.`a` = `test`.`t1`.`a`))
show session status like 'Last_query_cost';
Variable_name Value
Last_query_cost 6.240699
set @@optimizer_switch='firstmatch=off';
explain extended select * from t1 where (a,b) in (select a,b from t2);
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY t1 ALL NULL NULL NULL NULL 10 100.00
1 PRIMARY t2 ALL NULL NULL NULL NULL 100 100.00 Materialize
Warnings:
Note 1003 select `test`.`t1`.`a` AS `a`,`test`.`t1`.`b` AS `b` from `test`.`t1` semi join (`test`.`t2`) where ((`test`.`t2`.`b` = `test`.`t1`.`b`) and (`test`.`t2`.`a` = `test`.`t1`.`a`))
show session status like 'Last_query_cost';
Variable_name Value
Last_query_cost 11.740699
set @@optimizer_switch='materialization=off';
explain extended select * from t1 where (a,b) in (select a,b from t2);
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY t1 ALL NULL NULL NULL NULL 10 100.00 Start temporary
1 PRIMARY t2 ALL NULL NULL NULL NULL 100 100.00 Using where; End temporary; Using join buffer
Warnings:
Note 1003 select `test`.`t1`.`a` AS `a`,`test`.`t1`.`b` AS `b` from `test`.`t1` semi join (`test`.`t2`) where ((`test`.`t2`.`b` = `test`.`t1`.`b`) and (`test`.`t2`.`a` = `test`.`t1`.`a`))
show session status like 'Last_query_cost';
Variable_name Value
Last_query_cost 5056.240699
drop table t0, t1, t2;
How to repeat:
Run the following, and observer cost estimations:
create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1 (a int, b int);
insert into t1 select a,a from t0;
create table t2 (a int, b int);
insert into t2 select A.a + 10*B.a, A.a + 10*B.a from t0 A, t0 B;
explain extended select * from t1 where (a,b) in (select a,b from t2);
show session status like 'Last_query_cost';
set @@optimizer_switch='firstmatch=off';
explain extended select * from t1 where (a,b) in (select a,b from t2);
show session status like 'Last_query_cost';
set @@optimizer_switch='materialization=off';
explain extended select * from t1 where (a,b) in (select a,b from t2);
show session status like 'Last_query_cost';
Suggested fix:
Change the above code lines to:
double write_cost= prefix_rec_count * sj_outer_fanout * one_lookup_cost;
double full_lookup_cost= write_cost * sj_inner_fanout;
The cost for dup-weedout in the above example will then be 56.740699, which
seems more correct.
Description: Cost estimates for duplicate weed-out strategy for semijoin seems to be wrong. The code looks as follows: double write_cost= join->positions[first_tab].prefix_record_count* sj_outer_fanout * one_lookup_cost; double full_lookup_cost= join->positions[first_tab].prefix_record_count* sj_outer_fanout* sj_inner_fanout * one_lookup_cost; The problem is that the number of rows in first_tab is included in both join->positions[first_tab].prefix_record_count and in sj_inner_fanout. Hence the number of rows becomes too high. What we want is the number of rows from the the prefix before first_tab. This run inidicates that the cost for dup-weedout is too high: create table t0 (a int); insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9); create table t1 (a int, b int); insert into t1 select a,a from t0; create table t2 (a int, b int); insert into t2 select A.a + 10*B.a, A.a + 10*B.a from t0 A, t0 B; explain extended select * from t1 where (a,b) in (select a,b from t2); id select_type table type possible_keys key key_len ref rows filtered Extra 1 PRIMARY t1 ALL NULL NULL NULL NULL 10 100.00 1 PRIMARY t2 ALL NULL NULL NULL NULL 100 100.00 Using where; FirstMatch(t1) Warnings: Note 1003 select `test`.`t1`.`a` AS `a`,`test`.`t1`.`b` AS `b` from `test`.`t1` semi join (`test`.`t2`) where ((`test`.`t2`.`b` = `test`.`t1`.`b`) and (`test`.`t2`.`a` = `test`.`t1`.`a`)) show session status like 'Last_query_cost'; Variable_name Value Last_query_cost 6.240699 set @@optimizer_switch='firstmatch=off'; explain extended select * from t1 where (a,b) in (select a,b from t2); id select_type table type possible_keys key key_len ref rows filtered Extra 1 PRIMARY t1 ALL NULL NULL NULL NULL 10 100.00 1 PRIMARY t2 ALL NULL NULL NULL NULL 100 100.00 Materialize Warnings: Note 1003 select `test`.`t1`.`a` AS `a`,`test`.`t1`.`b` AS `b` from `test`.`t1` semi join (`test`.`t2`) where ((`test`.`t2`.`b` = `test`.`t1`.`b`) and (`test`.`t2`.`a` = `test`.`t1`.`a`)) show session status like 'Last_query_cost'; Variable_name Value Last_query_cost 11.740699 set @@optimizer_switch='materialization=off'; explain extended select * from t1 where (a,b) in (select a,b from t2); id select_type table type possible_keys key key_len ref rows filtered Extra 1 PRIMARY t1 ALL NULL NULL NULL NULL 10 100.00 Start temporary 1 PRIMARY t2 ALL NULL NULL NULL NULL 100 100.00 Using where; End temporary; Using join buffer Warnings: Note 1003 select `test`.`t1`.`a` AS `a`,`test`.`t1`.`b` AS `b` from `test`.`t1` semi join (`test`.`t2`) where ((`test`.`t2`.`b` = `test`.`t1`.`b`) and (`test`.`t2`.`a` = `test`.`t1`.`a`)) show session status like 'Last_query_cost'; Variable_name Value Last_query_cost 5056.240699 drop table t0, t1, t2; How to repeat: Run the following, and observer cost estimations: create table t0 (a int); insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9); create table t1 (a int, b int); insert into t1 select a,a from t0; create table t2 (a int, b int); insert into t2 select A.a + 10*B.a, A.a + 10*B.a from t0 A, t0 B; explain extended select * from t1 where (a,b) in (select a,b from t2); show session status like 'Last_query_cost'; set @@optimizer_switch='firstmatch=off'; explain extended select * from t1 where (a,b) in (select a,b from t2); show session status like 'Last_query_cost'; set @@optimizer_switch='materialization=off'; explain extended select * from t1 where (a,b) in (select a,b from t2); show session status like 'Last_query_cost'; Suggested fix: Change the above code lines to: double write_cost= prefix_rec_count * sj_outer_fanout * one_lookup_cost; double full_lookup_cost= write_cost * sj_inner_fanout; The cost for dup-weedout in the above example will then be 56.740699, which seems more correct.