Bug #47914 Performance Issue with Subquery
Submitted: 8 Oct 2009 9:10 Modified: 26 Mar 2012 19:30
Reporter: Xinhong Pu Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S2 (Serious)
Version:4.1 and 5.1 OS:Windows (Windows Server 2003)
Assigned to: CPU Architecture:Any
Tags: subquery

[8 Oct 2009 9:10] Xinhong Pu
Description:
I encounter a serious performance issue. Let's say I have a table X with more than 10 thousand records which is containing 2 main columns and any other columns: first column ID is an auto_increment column and second column A is an indexing column. Now I want to find out duplicated entries on column A. The first query here "select A from X group by A having count(*) > 1" is very fast and returned me the correct result. But the second query here "select ID, A from X where A in (select A from X group by A having count(*) > 1)" will hang the system. If I rewrite second query as "select ID, A from X where A in (the_result_from_first_query_in_comma_separated)", the query result will also be returned to me very fast. I tried on both MySQL 4.1 and 5.1 on different Windows platform and encountered same performance issues.

How to repeat:
See description above.

Suggested fix:
For such subquery, store the result from 1st query and generate 2nd query.
[8 Oct 2009 10:48] Valeriy Kravchuk
Thank you for the problem report. Please, send exact CREATE TABLE statement(s) and EXPLAIN results for the exact query from recent MySQL 5.1 (5.1.39).
[9 Oct 2009 3:02] Xinhong Pu
To repeat the result by doing the following steps in mysql command line client:

use test
create table X (ID int auto_increment primary key, A int, key A (A)) engine=MyISAM;
delimiter //
create procedure insrecs (n int)
begin
set @i = n;
repeat
insert into X (A) values (@i);
set @i = @i - 1;
until @i <= 0
end repeat;
end
//
delimiter ;
call insrecs(100000);
drop procedure insrecs;
insert into X (A) values (10);
select ID, A from X where A in (select A from X group by A having count(*) > 1);

The last query will hang the system. CPU usage will go up to 100% for single processor machine and no result is returned after waiting for quite a long time. The query "explain extended select ID, A from X where A in (select A from X group by A having count(*) > 1)" will return the following results:

+----+--------------------+-------+-------+---------------+------+---------+------+--------+-------------+-------------+
| id | select_type        | table | type  | possible_keys | key  | key_len | ref  | rows   | filtered    | Extra       |
+----+--------------------+-------+-------+---------------+------+---------+------+--------+-------------+-------------+
|  1 | PRIMARY            | x     | ALL   | NULL          | NULL | NULL    | NULL | 100001 |      100.00 | Using where |
|  2 | DEPENDENT SUBQUERY | x     | index | NULL          | A    | 5       | NULL |      1 | 10000100.00 | Using index |
+----+--------------------+-------+-------+---------------+------+---------+------+--------+-------------+-------------+
2 rows in set, 1 warning (0.00 sec)
[9 Oct 2009 3:51] Valeriy Kravchuk
I've got the same results with recent 5.1.40. It is a known problem of MySQL that considers some subqueries as dependent while they are NOT. The problem was fixed long time ago in 6.0.x and should be fixed in recent 5.4.x. Optimizer feature is called "materialization". Look:

77-52-242-160:6.0-codebase openxs$ bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 6.0.14-alpha-debug Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> create table X (ID int auto_increment primary key, A int, key A (A)) engine=MyISAM;
Query OK, 0 rows affected (0.42 sec)

mysql> delimiter //
mysql> create procedure insrecs (n int)
    -> begin
    -> set @i = n;
    -> repeat
    -> insert into X (A) values (@i);
    -> set @i = @i - 1;
    -> until @i <= 0
    -> end repeat;
    -> end
    -> //
Query OK, 0 rows affected (0.01 sec)

mysql> delimiter ;
mysql> call insrecs(100000);
Query OK, 0 rows affected (55.00 sec)

mysql> drop procedure insrecs;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into X (A) values (10);
Query OK, 1 row affected (0.01 sec)

mysql> explain
    -> extended select ID, A from X where A in (select A from X group by A having count(*) > 1)\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: X
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 100001
     filtered: 100.00
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: SUBQUERY
        table: X
         type: index
possible_keys: NULL
          key: A
      key_len: 5
          ref: NULL
         rows: 100001
     filtered: 100.00
        Extra: Using index
2 rows in set, 1 warning (0.00 sec)

mysql> show variables like 'optimizer_sw%'\G
*************************** 1. row ***************************
Variable_name: optimizer_switch
        Value: firstmatch=on,index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,loosescan=on,materialization=on,semijoin=on
1 row in set (0.00 sec)

mysql> select ID, A from X where A in (select A from X group by A having count(*) > 1)\G
*************************** 1. row ***************************
ID: 99991
 A: 10
*************************** 2. row ***************************
ID: 100001
 A: 10
2 rows in set (0.74 sec)

It is unlikely that the fix will be backported to 5.1. So, please, wait for new releases of 5.4.
[9 Oct 2009 8:50] Roy Lyseng
There are possible workarounds that will let this query run more efficiently:

1. Add a WHERE clause in the subquery:

select ID, A
from X
where A in (select A
            from X as Y
            where X.A=Y.A
            group by A
            having count(*) > 1);

2. or rewrite as EXISTS:

select ID, A
from X
where exists (select *
              from X as Y
              where X.A=Y.A
              group by A
              having count(*) > 1);

In both cases, the where clause in the subquery will direct the optimizer to use the index in each subquery evaluation. It is almost as fast as the materialization technique.
[26 Mar 2012 19:30] Paul DuBois
Noted in 5.6.5 changelog.

Several subquery performance issues were resolved through the
implementation of semi-join subquery optimization strategies.