Bug #36449 | Significant slowdown for (ANY|SOME) subqueries processed with semijoin | ||
---|---|---|---|
Submitted: | 1 May 2008 10:09 | Modified: | 29 Dec 2008 17:22 |
Reporter: | Alexey Stroganov | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Optimizer | Severity: | S2 (Serious) |
Version: | 6.0.6-pre | OS: | Any |
Assigned to: | Sergey Petrunya | CPU Architecture: | Any |
Tags: | subquery benchmark |
[1 May 2008 10:09]
Alexey Stroganov
[1 May 2008 10:10]
Alexey Stroganov
SQL file for the issue
Attachment: bug.subq.any.some.dump (application/octet-stream, text), 310.46 KiB.
[8 May 2008 5:11]
Sergey Petrunya
More data gathering: MySQL 6.0.2 didn't include any 6.0 subquery optimizations. EXPLAINs and query times: mysql> explain select * from bench5 where c = any (select c from bench6)\G *************************** 1. row *************************** id: 1 select_type: PRIMARY table: bench5 type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 10000 Extra: Using where *************************** 2. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: bench6 type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 10000 Extra: Using where Query time: 0.12 sec mysql> explain select * from bench5 where c = some (select c from bench6)\G *************************** 1. row *************************** id: 1 select_type: PRIMARY table: bench5 type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 10000 Extra: Using where *************************** 2. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: bench6 type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 10000 Extra: Using where Query time: 0.12 sec If I replace "select *" with "select sum(b)", query time is 0.08 sec for both queries.
[8 May 2008 20:54]
Sergey Petrunya
More investigations =================== The difference between the two queries is purely syntactical. MySQL converts both x=ANY and x=SOME to x IN (...) when parsing. So we're actually looking at only one query. The outer expression has only one value: bench5.c='GPL' always. The inner expression has two possible values: mysql> select c, count(*) from bench6 group by c ; +------+----------+ | c | count(*) | +------+----------+ | BSD | 5000 | | GPL | 5000 | +------+----------+ Handler counters analysis ------------------------- @@optimizer_switch='no_semijoin,no_materialization': Query time 0.19 sec | Handler_read_rnd_next | 20001 | IN->EXISTS scans each of the table once (can't get any faster) @@optimizer_switch='no_semijoin'; Query time 0.10 sec | Handler_read_rnd_next | 20002 | | Handler_write | 10002 | Materialization scans each table once and also makes 10K writes into temp table. The temporary table actually has only two elements. @@optimizer_switch=''; Query time 2 min 54 sec | Handler_read_rnd_next | 30003 | | Handler_write | 50000003 | One of the tables is read two times (perhaps join buffer is not big enough to make just one sweep), and there are 50M duplicate elimination checks. It seems that they are the problem. Note that if we try the corresponding join it will also be very slow: mysql> select count(*) from bench5,bench6 where bench5.c = bench6.c; +----------+ | count(*) | +----------+ | 50000000 | +----------+ 1 row in set (1 min 4.61 sec)
[8 May 2008 20:56]
Sergey Petrunya
This bug should be addressed by WL#3985.
[25 Nov 2008 16:11]
Sergey Petrunya
With WL#3985code, I get this: mysql> set @@optimizer_switch='no_materialization,no_semijoin'; Query OK, 0 rows affected (0.00 sec) mysql> explain select * from bench5 where c = any (select c from bench6)\G *************************** 1. row *************************** id: 1 select_type: PRIMARY table: bench5 type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 10000 Extra: Using where *************************** 2. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: bench6 type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 10000 Extra: Using where 2 rows in set (0.02 sec) mysql> set @@optimizer_switch=''; Query OK, 0 rows affected (0.00 sec) mysql> explain select sum(b) from bench5 where c = any (select c from bench6)\G *************************** 1. row *************************** id: 1 select_type: PRIMARY table: bench5 type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 10000 Extra: *************************** 2. row *************************** id: 1 select_type: PRIMARY table: bench6 type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 10000 Extra: Using where; FirstMatch(bench5) 2 rows in set (0.13 sec) Average query run times on my machine: 'no_materialization,no_semijoin': 0.11 sec (all optimizations on) : 0.07 sec
[25 Nov 2008 16:15]
Sergey Petrunya
That is, WL#3985 code makes a cost-based decision to use the FirstMatch strategy. FirstMatch execution is similar to the pre-6.0 execution strategy of this query. I'm not sure why FirstMatch it is faster - most likely running FirstMatch in a semi-join is faster than entering/leaving subquery re-evaluation code.
[25 Nov 2008 17:27]
Sergey Petrunya
The patch is WL#3985 code
[29 Dec 2008 17:21]
Sergey Petrunya
Changing status to closed as WL#3985 has been pushed into main tree.