MySQL Bugs: #42620: Wrong semi-join execution strategy for subquery

Bug #42620	Wrong semi-join execution strategy for subquery
Submitted:	5 Feb 2009 12:36	Modified:	17 Sep 2014 4:41
Reporter:	Alexey Stroganov	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S3 (Non-critical)
Version:	6.0.10-pre	OS:	Any
Assigned to:	Guilhem Bichot	CPU Architecture:	Any
Tags:	firstmatch, materialization, optimizer_switch, semijoin

Description:
Execution plan for query below has changed in 6.0.10:

EXPLAIN select count(expr_key) from ot where expr_key in
            ( SELECT expr_nokey FROM it ) and ot.expr_key<100000000

6.0.7:

EXPLAIN select count(expr_key) from ot where expr_key in             ( SELECT expr_nokey FROM it ) and ot.expr_key<100000000;
+----+-------------+-------+------+---------------+----------+---------+--------------------+--------+------------------------------+
| id | select_type | table | type | possible_keys | key      | key_len | ref                | rows   | Extra                        |
+----+-------------+-------+------+---------------+----------+---------+--------------------+--------+------------------------------+
|  1 | PRIMARY     | it    | ALL  | NULL          | NULL     | NULL    | NULL               | 320000 | Using where; Start temporary |
|  1 | PRIMARY     | ot    | ref  | expr_key      | expr_key | 4       | test.it.expr_nokey |      1 | Using index; End temporary   |
+----+-------------+-------+------+---------------+----------+---------+--------------------+--------+------------------------------+
2 rows in set (0.00 sec)

Execution time: 

select count(expr_key) from ot where expr_key in             ( SELECT expr_nokey FROM it ) and ot.expr_key<100000000;
+-----------------+
| count(expr_key) |
+-----------------+
|             991 |
+-----------------+
1 row in set (0.13 sec)

6.0.10:
EXPLAIN select count(expr_key) from ot where expr_key in             ( SELECT expr_nokey FROM it ) and ot.expr_key<100000000;
+----+-------------+-------+-------+---------------+----------+---------+------+--------+-----------------------------+
| id | select_type | table | type  | possible_keys | key      | key_len | ref  | rows   | Extra                       |
+----+-------------+-------+-------+---------------+----------+---------+------+--------+-----------------------------+
|  1 | PRIMARY     | ot    | range | expr_key      | expr_key | 4       | NULL |    726 | Using where; Using index    |
|  1 | PRIMARY     | it    | ALL   | NULL          | NULL     | NULL    | NULL | 320000 | Using where; FirstMatch(ot) |
+----+-------------+-------+-------+---------------+----------+---------+------+--------+-----------------------------+
2 rows in set (0.00 sec)

Execution time:
select count(expr_key) from ot where expr_key in             ( SELECT expr_nokey FROM it ) and ot.expr_key<100000000;
+-----------------+
| count(expr_key) |
+-----------------+
|             991 |
+-----------------+
1 row in set (2.91 sec)

How to repeat:
1. Run attached datagen tool to generate and populate database:

perl datagen.pl --generate --ot-rows=10000 --fanout=32  --load-data --database=test --user=root --socket=/tmp/mysql.sock

2. Use following query to check described issue:

select count(expr_key) from ot where expr_key in             ( SELECT expr_nokey FROM it ) and ot.expr_key<100000000;

Datagen tool

Attachment: datagen.pl (application/octet-stream, text), 8.95 KiB.

Not affected by optimizer_switch, optimizer_use_mrr, or engine_condition_pushdown values.

With 5.1 (5.1.38 jperkin@sun.com-20090824091334-6ktgrhq218vl7zq1), time is 7.61 sec and plan is
+----+--------------------+-------+-------+---------------+----------+---------+------+--------+--------------------------+
| id | select_type        | table | type  | possible_keys | key      | key_len | ref  | rows   | Extra                    |
+----+--------------------+-------+-------+---------------+----------+---------+------+--------+--------------------------+
|  1 | PRIMARY            | ot    | range | expr_key      | expr_key | 4       | NULL |    726 | Using where; Using index |
|  2 | DEPENDENT SUBQUERY | it    | ALL   | NULL          | NULL     | NULL    | NULL | 320000 | Using where              |
+----+--------------------+-------+-------+---------------+----------+---------+------+--------+--------------------------+

Now with mysql-next-bugfixing (6.0 code base), revision
jon.hauglid@sun.com-20090826115923-d6516e54fmyxq3nn .

mysql> explain select count(expr_key) from ot where expr_key in             ( SELECT expr_nokey FROM it ) and ot.expr_key<100000000;
+----+-------------+-------+-------+---------------+----------+---------+------+--------+-----------------------------+
| id | select_type | table | type  | possible_keys | key      | key_len | ref  | rows   | Extra                       |
+----+-------------+-------+-------+---------------+----------+---------+------+--------+-----------------------------+
|  1 | PRIMARY     | ot    | range | expr_key      | expr_key | 4       | NULL |    726 | Using where; Using index    |
|  1 | PRIMARY     | it    | ALL   | NULL          | NULL     | NULL    | NULL | 320000 | Using where; FirstMatch(ot) |
+----+-------------+-------+-------+---------------+----------+---------+------+--------+-----------------------------+

mysql> select count(expr_key) from ot where expr_key in             ( SELECT expr_nokey FROM it ) and ot.expr_key<100000000;
+-----------------+
| count(expr_key) |
+-----------------+
|             991 |
+-----------------+
1 row in set (8.91 sec)

I turn off all sub-switches of optimizer_switch:
set optimizer_switch="firstmatch=off,index_merge=off,index_merge_union=off,index_merge_sort_union=off,index_merge_intersection=off,loosescan=off,materialization=off,semijoin=off";
then time is 8.69 sec, and
mysql> explain select count(expr_key) from ot where expr_key in             ( SELECT expr_nokey FROM it ) and ot.expr_key<100000000;+----+--------------------+-------+-------+---------------+----------+---------+------+--------+--------------------------+
| id | select_type        | table | type  | possible_keys | key      | key_len | ref  | rows   | Extra                    |
+----+--------------------+-------+-------+---------------+----------+---------+------+--------+--------------------------+
|  1 | PRIMARY            | ot    | range | expr_key      | expr_key | 4       | NULL |    726 | Using where; Using index |
|  2 | DEPENDENT SUBQUERY | it    | ALL   | NULL          | NULL     | NULL    | NULL | 320000 | Using where              |
+----+--------------------+-------+-------+---------------+----------+---------+------+--------+--------------------------+
(same plan as 5.1).

I turn on all sub-switches of optimizer_switch:
set optimizer_switch="firstmatch=on,index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,loosescan=on,materialization=on,semijoin=on";
and turn them off one by one (all on and one off): only semijoin=off or firstmatch=off make a difference, and firstmatch=off,materialization=off gives the best result and the same plan as 6.0.7; see below.

After doing
set optimizer_switch="semijoin=off";
time goes down to 1.26 sec and
mysql> explain select count(expr_key) from ot where expr_key in             ( SELECT expr_nokey FROM it ) and ot.expr_key<100000000;
+----+-------------+-------+-------+---------------+----------+---------+------+--------+--------------------------+
| id | select_type | table | type  | possible_keys | key      | key_len | ref  | rows   | Extra                    |
+----+-------------+-------+-------+---------------+----------+---------+------+--------+--------------------------+
|  1 | PRIMARY     | ot    | range | expr_key      | expr_key | 4       | NULL |    726 | Using where; Using index |
|  2 | SUBQUERY    | it    | ALL   | NULL          | NULL     | NULL    | NULL | 320000 |                          |
+----+-------------+-------+-------+---------------+----------+---------+------+--------+--------------------------+

After doing
set optimizer_switch="semijoin=on,firstmatch=off";
time is 1.28 sec and
mysql> explain select count(expr_key) from ot where expr_key in             ( SELECT expr_nokey FROM it ) and ot.expr_key<100000000;+----+-------------+-------+-------+---------------+----------+---------+------+--------+--------------------------+
| id | select_type | table | type  | possible_keys | key      | key_len | ref  | rows   | Extra                    |
+----+-------------+-------+-------+---------------+----------+---------+------+--------+--------------------------+
|  1 | PRIMARY     | ot    | range | expr_key      | expr_key | 4       | NULL |    726 | Using where; Using index |
|  1 | PRIMARY     | it    | ALL   | NULL          | NULL     | NULL    | NULL | 320000 | Materialize              |
+----+-------------+-------+-------+---------------+----------+---------+------+--------+--------------------------+

After doing
set optimizer_switch="semijoin=on,firstmatch=off,materialization=off";
time goes to 0.76 sec and
mysql> explain select count(expr_key) from ot where expr_key in             ( SELECT expr_nokey FROM it ) and ot.expr_key<100000000;+----+-------------+-------+------+---------------+----------+---------+--------------------+--------+------------------------------+
| id | select_type | table | type | possible_keys | key      | key_len | ref                | rows   | Extra                        |
+----+-------------+-------+------+---------------+----------+---------+--------------------+--------+------------------------------+
|  1 | PRIMARY     | it    | ALL  | NULL          | NULL     | NULL    | NULL               | 320000 | Using where; Start temporary |
|  1 | PRIMARY     | ot    | ref  | expr_key      | expr_key | 4       | test.it.expr_nokey |      1 | Using index; End temporary   |
+----+-------------+-------+------+---------------+----------+---------+--------------------+--------+------------------------------+
which is the plan of 6.0.7.

In mysql-next-trunk, the 3 parameters seem to be firstmatch, semijoin, materialization:
firstmatch=on, semijoin=on, materialization=on = 8.93 sec
firstmatch=off, semijoin=off, materialization=off = 8.70 sec
firstmatch=off, semijoin=on, materialization=off = 0.76 sec
firstmatch=off, semijoin=off, materialization=on = 1.17 sec
So, note one important fact:
just disabling all 3 parameters makes the query take almost as long as when they are all enabled, and which is 14% slower than 5.1.
So if we released with all parameters off, it would still be a regression.

sorry, I meant "mysql-next-bugfixing" in the last post, not "mysql-next-trunk".

We will look into what causes the performance regression. Then we may fix the problem, or request a re-triage if we decide that the fix is impossible, too risky or too large effort.

plan change happened between 6.0.9 and 6.0.10. Now refining the search.

plan change introduced by sergefp@mysql.com-20080712192605-8ctivu0kz6ih6owu :
  WL#3985: Subquery optimization: smart choice between semi-join and materialization

got a long WL to review, no progress on this one soon.

revision-id:zhenxing.he@sun.com-20091208073923-e8ond3akoolvw6b2

optimizer_switch     chosen strategy                            time(sec)
-------------------------------------------------------------------------
default              Firstmatch                                 10
all off              IN->EXISTS (makes subquery correlated)     10
                     (= what 5.1 does)
firstmatch=off       Semijoin Materialization                   1.27
materialization=on,  Subquery Materialization                   1.27
  semijoin=off
materialization=off, Duplicate Weedout                          0.85
  firstmatch=off

I should retry with a non-debug build to check those times.

with a non-debug build (-O3) on different Linux machine, the times are:
5.6 5.6 0.30 0.30 0.20. Ratios are not the same as the debug build on my first machine, but the order is the same.
The server is started by mtr without --mem.

Some guesses (partly backed by debugging and by EXPLAIN EXTENDED), on the number of operations done by each strategy. Table 'it' contains 320,000 rows, and 'ot' 10,000. The "<100000000" condition, if applied to any of those tables, returns only ~10% of rows. The involved column of 'it' is not indexed, the column of 'ot' is.
* Firstmatch: 1000 (index range scan on 'ot') * 320,000/2 (scan 'it' and stop at first match, on average this stops in the middle of 'it')
* IN->EXISTS: same
* Subquery materialization or semijoin materialization: (materialize 320,000 rows of 'it') + 1000 (index range scan on 'ot') * (1 hash lookup in materialized table)
* Duplicate weedout: 320,000 (scan 'it') + 32,000 (after applying propagated inequality it.expr_nokey<100000000) * (1 B-tree lookup in 'ot') + duplicate elimination.
The specificity of Weedout is that:
- it puts 'it' first in the plan (putting the non-indexed table first is good, avoids repeated scans, which is the problem of FirstMatch)
- it propagates the inequality to 'it', which gives only 10% of 'it' rows for joining.
Materialization is good too but probably loses due to time to materialize the table (which is identical to 'it'), especially as it materializes it all (does not propagate the inequality to 'it' when materializing, so materialized table has 320,000 rows).
FirstMatch does too many repeated scans of 'it'.

If the inequality is modified to not be selective at all (multiply boundary by 10), the numbers are:
Firstmatch and IN->EXISTS: 56 secs
Both materializations: 0.30
Weedout: 1.6.
So this time materializations are best.

This is in relation with WL#3985 "smart choice between firstmatch and materialization": this WL introduced the bug (choose of firstmatch over weedout); for both tests (selective conditions and non-selective), materialization is a better choice than firstmatch, so it must be WL#3985 estimating cost wrongly.

I ran the above query and looked at estimated cost and actual handler accesses:

                    FirstMatch      Materialization     Duplicate Weedout

Last_Query_Cost           2106               322831               1025798

Handler_read_first           1                    1                     0 
Handler_read_key             0                  991                 31712
Handler_read_next          991                  991                 31712
Handler_read_rnd_next  4968310               320001                320001 
Handler_write                0               320000                 31712

Thanks Oystein. I think it matches my theory this way:
* firstmatch: index range scan of "ot" which yields 1000 records (handler_read_next=991), then for each such record, full table scan of "it" (handler_read_rnd_next=4968310, though lower than my estimate which was that the matching record would be on average in the middle of the table).
* materializations: q

Thanks Oystein! I think it matches my theory this way:
* firstmatch: index range scan of "ot" which yields ~1000 records (handler_read_next=991), then for each such record, table scan of "it" (handler_read_rnd_next=4968310, though lower than my estimate which was that the matching record would be on average in the middle of the table).
* materializations: index range scan of "ot" which yields ~1000 records (handler_read_next=991), then for each such record, unique index lookup in materialized table (handler_read_key=991). Plus building the materialized table by scanning "it" (handler_read_rnd_next=320001) and writing to the materialized table (handler_write=320000).
* duplicate weedout: table scan of "it" (handler_read_rnd_next=320001), application of the inequality (keeps 10% of rows), for each such record, a first non-unique index lookup in "ot" (handler_read_key=31712), then a second index lookup (handler_read_rnd_next=31712) (which finds no key, because it turns out that all values in "ot" are distinct), result is sent to tmp table (handler_write=31712) for duplicate eliminations.

Comparison of speed of 6.0 using the plan of 5.1 (via set optimizer_switch="default,materialization=off,semijoin=off") versus 5.1:
                                  6.0 (seconds)   5.1
debug build, one machine:         10              7
non-debug build, other machine:   5.6             5.6

More complete comparison of speed of 6.0 using the plan of 5.1 (via set
optimizer_switch="default,materialization=off,semijoin=off") versus 5.1:

                                  6.0 (seconds)   5.1
debug build, one machine          10              7
non-debug build, same machine     4.3             4.0
debug build, other machine        10              7
non-debug build, other machine    5.6             5.6
Conclusion: debug builds are irrelevant for performance; non-debug build show either no regression between 6.0 and 5.1 (on one machine) or a 8% regression (on another machine). The machine with no regression is a server, the other machine is a laptop, so I'd trust the server results (no regression) more.
Firstmatch has the same timings as 6.0-with-the-5.1 plan.
So I'd conclude that:
- there's no regression between 6.0 and 5.1
- 6.0 is currently less efficient than 6.0.7: 6.0.7 used duplicate weedout which is better than nowadays' 6.0 default (firstmatch).
So I propose to remove it from the regressions list, and investigate if it's possible to make 6.0 prefer duplicate weedout or at least materializations.

The "handler_%" stats for 6.0 running the plan of 5.1, and for 5.1, are identical to the stats of firstmatch. Sounds logical. Also suggests that EXISTS stops at first match even in 5.1.

There is a strategy which is even faster than all others (twice faster than duplicate weedout), for this test scenario: it is "semijoin materialization with scan" (scan the materialized table and for each found row, do an index lookup in the outer table 'ot'). I couldn't provoke it with a normal binary, but adding this *very wrong* line:
reopt_rec_count*=sj_inner_fanout;
right after:
        /*
          We don't yet know what are the other strategies, so pick the
          FirstMatch.

          We ought to save the alternate POSITIONs produced by
          optimize_wo_join_buffering but the problem is that providing save
          space uses too much space. Instead, we will re-calculate the
          alternate POSITIONs after we've picked the best QEP.
        */
accidentally provokes usage of this strategy.
So the fastest strategies are, in order:
sjmat-scan, dup weedout, sjmat-lookup/subq-mat, firstmatch.
It looks like there are several bugs.
First, cost of firstmatch is too low. Even without fixing this, by disabling firstmatch with @@optimizer_switch, it should already pick sjmat-scan, but it rather picks sjmat-lookup, which is a second bug.

There are four issues:
1) cost of firstmatch is too low
2) cost of duplicate weedout is too high
3) cost of sjmat-lookup is lower than cost of sjmat-scan
4) even if it had a very small cost, sjmat-scan would not be chosen because:
4a) we first consider plan ("ot","it") which picks sjmat-lookup (assuming (1) is fixed or firstmatch is off)
4b) then we consider table "it" as partial plan (in bext_extension_by_limited_search()): in advance_sj_state() we cannot yet decide on sjmat-scan so we postpone the decision to the next call to advance_sj_state() (which is when we consider ("it","ot"))
4c) but we never consider ("it","ot") because the partial plan "it" is considered already too costly (reaches "prune_by_cost" in sql_select.cc). This partial plan is said to have cost "cost of full table scan"(1799) + #records(320000)/5 = 65799 which is much higher than cost of sjmat-lookup (17997+726/5) found for "ot","it". Those costs are weird: sjmat-lookup does materialize rows so its cost should exceed the table scan cost. Except if it takes into account the inequality propagation, in which case it materializes only 10% of rows. But in that case, sjmat-scan, which is also capable of using inequality propagation, could have a lower cost than the partial "it" plan.

I did a speed comparison on a third machine (4-core 12-GB RAM brand new, Linux 64-bit, BUILD/compile-pentium64 optimized build): the query takes the same time on 5.1 (5.1.43) and 6.0-codebase-bugfixing (guilhem@mysql.com-20100120101129-vs1cd8equni03sh3): 1.90 secs in both cases. When disabling semijoin and materialization, 6.0-codebase-bugfixing still takes 1.90 secs.
So this is not a regression between 5.1 and 6.0-codebase-bugfixing, so I'm removing tag "regression".
Though 6.0-codebase-bugfixing has the capacity of being 10 or 20 times faster if it would pick the best strategy (semijoin materialization scan).

I'm adding optimizer_switch tag in this sense:
if optimizer_switch=semijoin=off is used, the server is only allowed to choose the 5.1 query plan and so cannot be accused of not choosing the best plan.

I spent weeks on this complex bug and it seems I have to understand the entire WL#3985 patch, plus the cost-based optimizer, to fix this, which is too much for now. Besides it's not a regression so it's not urgent. Moving to other tasks now.

This was fixed in 5.6.