MySQL Bugs: #49349: Wrong cost calculation for duplicate weed-out

Bug #49349	Wrong cost calculation for duplicate weed-out
Submitted:	2 Dec 2009 13:29	Modified:	24 Dec 2012 8:47
Reporter:	Øystein Grøvlen	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S3 (Non-critical)
Version:	6.0	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	optimizer_switch, semijoin, subquery

Description:
Cost estimates for duplicate weed-out strategy for semijoin seems to be wrong.
The code looks as follows:

      double write_cost= join->positions[first_tab].prefix_record_count* 
                         sj_outer_fanout * one_lookup_cost;
      double full_lookup_cost= join->positions[first_tab].prefix_record_count* 
                               sj_outer_fanout* sj_inner_fanout * 
                               one_lookup_cost;

The problem is that the number of rows in first_tab is included in both
join->positions[first_tab].prefix_record_count and in sj_inner_fanout.
Hence the number of rows becomes too high. What we want is the number of
rows from the the prefix before first_tab.

This run inidicates that the cost for dup-weedout is too high:

create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1 (a int, b int);
insert into t1 select a,a from t0;
create table t2 (a int, b int);
insert into t2 select A.a + 10*B.a, A.a + 10*B.a from t0 A, t0 B;
explain extended select * from t1 where (a,b) in (select a,b from t2);
id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	t1	ALL	NULL	NULL	NULL	NULL	10	100.00	
1	PRIMARY	t2	ALL	NULL	NULL	NULL	NULL	100	100.00	Using where; FirstMatch(t1)
Warnings:
Note	1003	select `test`.`t1`.`a` AS `a`,`test`.`t1`.`b` AS `b` from `test`.`t1` semi join (`test`.`t2`) where ((`test`.`t2`.`b` = `test`.`t1`.`b`) and (`test`.`t2`.`a` = `test`.`t1`.`a`))
show session status like 'Last_query_cost';
Variable_name	Value
Last_query_cost	6.240699
set @@optimizer_switch='firstmatch=off';
explain extended select * from t1 where (a,b) in (select a,b from t2);
id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	t1	ALL	NULL	NULL	NULL	NULL	10	100.00	
1	PRIMARY	t2	ALL	NULL	NULL	NULL	NULL	100	100.00	Materialize
Warnings:
Note	1003	select `test`.`t1`.`a` AS `a`,`test`.`t1`.`b` AS `b` from `test`.`t1` semi join (`test`.`t2`) where ((`test`.`t2`.`b` = `test`.`t1`.`b`) and (`test`.`t2`.`a` = `test`.`t1`.`a`))
show session status like 'Last_query_cost';
Variable_name	Value
Last_query_cost	11.740699
set @@optimizer_switch='materialization=off';
explain extended select * from t1 where (a,b) in (select a,b from t2);
id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	t1	ALL	NULL	NULL	NULL	NULL	10	100.00	Start temporary
1	PRIMARY	t2	ALL	NULL	NULL	NULL	NULL	100	100.00	Using where; End temporary; Using join buffer
Warnings:
Note	1003	select `test`.`t1`.`a` AS `a`,`test`.`t1`.`b` AS `b` from `test`.`t1` semi join (`test`.`t2`) where ((`test`.`t2`.`b` = `test`.`t1`.`b`) and (`test`.`t2`.`a` = `test`.`t1`.`a`))
show session status like 'Last_query_cost';
Variable_name	Value
Last_query_cost	5056.240699
drop table t0, t1, t2;

How to repeat:
Run the following, and observer cost estimations:

create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1 (a int, b int);
insert into t1 select a,a from t0;
create table t2 (a int, b int); 
insert into t2 select A.a + 10*B.a, A.a + 10*B.a from t0 A, t0 B;

explain extended select * from t1 where (a,b) in (select a,b from t2);
show session status like 'Last_query_cost';

set @@optimizer_switch='firstmatch=off';
explain extended select * from t1 where (a,b) in (select a,b from t2);
show session status like 'Last_query_cost';

set @@optimizer_switch='materialization=off';
explain extended select * from t1 where (a,b) in (select a,b from t2);
show session status like 'Last_query_cost';

Suggested fix:
Change the above code lines to:

      double write_cost= prefix_rec_count * sj_outer_fanout * one_lookup_cost;
      double full_lookup_cost= write_cost * sj_inner_fanout;

The cost for dup-weedout in the above example will then be 56.740699, which
seems more correct.

I think there is another issue with duplicate-weedout cost.  Fan out for inner tables that come before outer tables in the weedout range, need to be taken 
into account when computing the number of accesses to the outer table. I suggest the following change:

@@ -13478,7 +13478,8 @@
         }
         else
         {
-          sj_outer_fanout *= p->records_read;
+          sj_outer_fanout*= p->records_read * sj_inner_fanout;
+          sj_inner_fanout= 1.0;
           temptable_rec_size += p->table->table->file->ref_length;
         }

Justifying example:
Inner table is scanned and an index is used to look up matching records in the
outer table.  p->records_read for the outer table will be 1 if there is on average on row per key.  However, the total number of records that need to be
stored is number of rows in the preceding inner table (assuming 1-1 matching).

This was fixed in 5.6.4.