MySQL Bugs: #12077: SELECT ... WHERE field IN subquery - IN clause not optimized/using indexes

Bug #12077	SELECT ... WHERE field IN subquery - IN clause not optimized/using indexes
Submitted:	21 Jul 2005 10:46	Modified:	21 Mar 2009 17:27
Reporter:	Azza Azza69	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S3 (Non-critical)
Version:	4.1.12	OS:	Windows (XP)
Assigned to:	Igor Babaev	CPU Architecture:	Any
Tags:	subquery, subquery benchmark

Description:
select * from epos_transaction_lines where
uid_header in
(
select uid FROM epos_transactions
where UID_zreset in
(
'7895022841e30929559c4851bf95eab3'
)

takes around 59 seconds despite having the appropriate indexes on both the epos_transaction_lines and epos_transactions ... but the following takes less than 1 second:

select * from epos_transaction_lines where
uid_header=
(
select uid FROM epos_transactions
where UID_zreset='17895022841e30929559c4851bf95eab3'
)

**** The only difference being the 'IN' has changed to an '=' ****

epos_transactions contains 332,380 records
epos_transaction_lines contains 1,536,633 records

How to repeat:
CREATE TABLE epos_transaction_lines (
  UID_Header varchar(32) NOT NULL default '',
  LineNumber smallint(5) unsigned NOT NULL default '0',
  LineType enum('Sale','Void','Discount','Promotion','Deposit','Bundle') default NULL,
  Code varchar(10) default NULL,
  Description varchar(35) default NULL,
  Price decimal(10,2) default NULL,
  ActualPrice decimal(10,2) default NULL,
  VATPaid decimal(10,2) default NULL,
  VATCode char(1) character set latin1 collate latin1_bin default NULL,
  VATRate decimal(6,2) default NULL,
  Input_Type enum('Manual','Scanned') NOT NULL default 'Manual',
  Input_Data varchar(20) default NULL,
  Input_DateTime datetime default NULL,
  Input_By varchar(25) default NULL,
  LinkType enum('Sale','Void','Discount','Promotion','Deposit','Bundle') default NULL,
  LinkLineNumber smallint(5) unsigned NOT NULL default '0',
  Group1 varchar(30) default NULL,
  PRIMARY KEY  (UID_Header,LineNumber),
  KEY LineType (LineType),
  KEY Code (Code),
  KEY Price (Price)
) ENGINE=MyISAM DEFAULT CHARSET=latin1

CREATE TABLE epos_transactions (
  UID_ZReset varchar(32) NOT NULL default '',
  UID varchar(32) NOT NULL default '',
  BranchCode varchar(10) default NULL,
  RegisterNumber tinyint(3) unsigned default NULL,
  TransactionNumber smallint(5) unsigned default NULL,
  TransactionDateTime datetime default NULL,
  FinishedBy varchar(20) default NULL,
  NumberOfLines smallint(6) unsigned default NULL,
  PriceTotal decimal(10,2) default NULL,
  ActualPriceTotal decimal(10,2) default NULL,
  VATTotal decimal(10,2) default NULL,
  NumberOfPayments smallint(6) unsigned default NULL,
  TenderedTotal decimal(10,2) default NULL,
  ChangeTotal decimal(10,2) default NULL,
  WasCashDrawerOpened enum('N','Y') default 'N',
  WasDuplicateReceiptNeeded enum('N','Y') default 'N',
  Receipt text,
  ZResetNumber smallint(5) unsigned default NULL,
  PRIMARY KEY  (UID),
  KEY BranchCode (BranchCode),
  KEY RegisterNumber (RegisterNumber),
  KEY ZResetNumber (ZResetNumber),
  KEY TransactionNumber (TransactionNumber),
  KEY TransactionDateTime (TransactionDateTime),
  KEY FinishedBy (FinishedBy),
  KEY TransactionTotal (PriceTotal),
  KEY UID_ZReset (UID_ZReset)
) ENGINE=MyISAM DEFAULT CHARSET=latin1

Trying to produce as small test case as possible.

Thank you for the EXPLAIN results. Does the 2nd query (without IN) give you any rows really? How many rows are returned from the first query?

It is a non-existant uid so both queries return 0 rows...
First query (using IN) gives 0 rows in 70 seconds.
Second query (using =) gives 0 rows in 0.03 seconds.

(<ahem>I only tried using the '=' when I found it was taking ages with the 'IN', so I didn't actually test the query and results properly)...sorry :( so I can't actually use the '=' in the subquery as it returns multiple rows, *BUT* if I use:

select * from epos_transaction_lines where
uid_header in
(
select uid FROM epos_transactions
where UID_zreset=
'7438db7418f21a3e29bebc9f9c750d70'
)

both files still have indexes and it still takes 46.50 seconds to return 254 rows...

and here's the EXPLAIN for it:
"id","select_type","table","type","possible_keys","key","key_len","ref","rows","Extra"
"1","PRIMARY","epos_transaction_lines","ALL","Null","Null","Null","Null","1,537,336","Using where"
"2","DEPENDENT SUBQUERY","epos_transactions","unique_subquery","PRIMARY,UID_ZReset","PRIMARY","32","func","1","Using index; Using where"

Your queries in the initial bugreport are not identical:

Query 1:
select * from epos_transaction_lines where
uid_header in
(
select uid FROM epos_transactions
where UID_zreset in
(
'7895022841e30929559c4851bf95eab3'
)

Query 2:
select * from epos_transaction_lines where
uid_header=
(
select uid FROM epos_transactions
where UID_zreset='17895022841e30929559c4851bf95eab3'
)

Please, note that '1' is uppended in the second query (UID_zreset='1...').
If it is not a misprint, then it can explain the difference in execution times.

The queries in your additional comment (with EXPLAIN results) use identical values, but it looks like the second query returned nothing to you.

I tried to produce a simple similar test case:

CREATE TABLE master (
  UID_ZReset varchar(32) NOT NULL default '',
  UID varchar(32) NOT NULL default '',
  Other varchar(100),
  PRIMARY KEY  (UID),
  KEY UID_ZReset (UID_ZReset)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

CREATE TABLE details (
  UID_Header varchar(32) NOT NULL default '',
  LineNumber smallint(5) unsigned NOT NULL default '0',
  Description varchar(35) default NULL,
  PRIMARY KEY  (UID_Header,LineNumber)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

I put 5 rows in master and 30 in details then:

insert into master values('7895022841e30929559c4851bf95eab3', '7895022841e30929559c4851bf95eab3', 'Something');
insert into master values('7895022841e30929559c4851bf95eab4', '2', 'Something');
insert into master values('7895022841e30929559c4851bf95eab5', '3', 'Something');
insert into master values('7895022841e30929559c4851bf95eab6', '4', 'Something');
insert into master values('7895022841e30929559c4851bf95eab7', '5', 'Something');

And 25 rows in details (similar ratio of master # of rows to details as in your initital description):

insert into details values('7895022841e30929559c4851bf95eab3', 1, 'line1');
insert into details values('7895022841e30929559c4851bf95eab3', 2, 'line2');
insert into details values('7895022841e30929559c4851bf95eab3', 3, 'line3');
insert into details values('7895022841e30929559c4851bf95eab3', 4, 'line4');
insert into details values('7895022841e30929559c4851bf95eab3', 5, 'line5');

insert into details values('7895022841e30929559c4851bf95eab4', 1, 'line1');
insert into details values('7895022841e30929559c4851bf95eab4', 2, 'line2');
insert into details values('7895022841e30929559c4851bf95eab4', 3, 'line3');
insert into details values('7895022841e30929559c4851bf95eab4', 4, 'line4');
insert into details values('7895022841e30929559c4851bf95eab4', 5, 'line5');
...

Then I analyzed both tables:

analyze table master;
analyze table details;

The queries similar tou yours gave me:

mysql> select * from details where uid_header = (select uid from master where uid_zreset = '7895022841e30929559c4851bf95eab3');
+----------------------------------+------------+-------------+
| UID_Header                       | LineNumber | Description |
+----------------------------------+------------+-------------+
| 7895022841e30929559c4851bf95eab3 |          1 | line1       |
| 7895022841e30929559c4851bf95eab3 |          2 | line2       |
| 7895022841e30929559c4851bf95eab3 |          3 | line3       |
| 7895022841e30929559c4851bf95eab3 |          4 | line4       |
| 7895022841e30929559c4851bf95eab3 |          5 | line5       |
+----------------------------------+------------+-------------+
5 rows in set (0.00 sec)

mysql> explain select * from details where uid_header in (select uid from master
 where uid_zreset in ('7895022841e30929559c4851bf95eab3'));
+----+--------------------+---------+-----------------+--------------------+----
-----+---------+------+------+--------------------------+
| id | select_type        | table   | type            | possible_keys      | key
     | key_len | ref  | rows | Extra                    |
+----+--------------------+---------+-----------------+--------------------+----
-----+---------+------+------+--------------------------+
|  1 | PRIMARY            | details | ALL             | NULL               | NUL
L    |    NULL | NULL |   25 | Using where              |
|  2 | DEPENDENT SUBQUERY | master  | unique_subquery | PRIMARY,UID_ZReset | PRI
MARY |      32 | func |    1 | Using index; Using where |
+----+--------------------+---------+-----------------+--------------------+----
-----+---------+------+------+--------------------------+
2 rows in set (0.00 sec)

The same plan as yours, but with much less rows. Then:

mysql> explain select * from details where uid_header = (select uid from master
where uid_zreset = '7895022841e30929559c4851bf95eab3');
+----+-------------+---------+------+---------------+------------+---------+----
---+------+-------------+
| id | select_type | table   | type | possible_keys | key        | key_len | ref
   | rows | Extra       |
+----+-------------+---------+------+---------------+------------+---------+----
---+------+-------------+
|  1 | PRIMARY     | details | ref  | PRIMARY       | PRIMARY    |      32 | con
st |    5 | Using where |
|  2 | SUBQUERY    | master  | ref  | UID_ZReset    | UID_ZReset |      32 |
   |    1 | Using where |
+----+-------------+---------+------+---------------+------------+---------+----
---+------+-------------+
2 rows in set (0.00 sec)

Which is different from your plan, although it is "perfect" (primary key used). So, i think it's really a bug (in the optimization of the first query).

The plan like yours can be obtained when the non-existing value passed:

mysql> explain select * from details where uid_header = (select uid from master
where uid_zreset = '0');
+----+-------------+--------+------+---------------+------------+---------+-----
-+------+-----------------------------------------------------+
| id | select_type | table  | type | possible_keys | key        | key_len | ref
 | rows | Extra                                               |
+----+-------------+--------+------+---------------+------------+---------+-----
-+------+-----------------------------------------------------+
|  1 | PRIMARY     | NULL   | NULL | NULL          | NULL       |    NULL | NULL
 | NULL | Impossible WHERE noticed after reading const tables |
|  2 | SUBQUERY    | master | ref  | UID_ZReset    | UID_ZReset |      32 |
 |    1 | Using where                                         |
+----+-------------+--------+------+---------------+------------+---------+-----
-+------+-----------------------------------------------------+
2 rows in set (0.00 sec)

In this particular case no rows (NULL) where processed from the detais table (because subquery returned 0 rows)

Thank you for the bug report.

Phew - what a reply :)
So there IS a bug in the optimizer then!
Do you think this will be fixed for next version?
Is there an ETA on the next version?
I only ask as you can imagine, these files will just keep getting bigger and if it takes ~ 70 seconds now, maybe next month it will take ~140 seconds!!

[Sorry I was slightly misleading from the original post but I got there in the end ;)]

Current implementation of subqueries cannot allow us to fix this problem. The fact is that when the optimizer looks for an execution plan for the main query it does have any execution plan for the subquery (the subquery is not optimized yet at this moment). So the optimizer cannot guess that the subquery always returns 1 row (actually it has no estimate on the number of rows returned by the subquery).
We can hope to resolve this problem when transforming the query into a join query.
Most probably we'll be able to implement this transformation in version 5.2.

Changed synopsis so searches for 'subquery' can find this

Hi!

In MySQL 6.0 we've significantly reworked subquery optimizations, and in particular, the subqueries in form 
 
  select ... where in (select primary_key ...)

are now internally converted to inner joins. This means that WHERE ... IN (SELECT ...) subquery will now run as fast as =(SELECT ...) subquery.

Preliminary documentation can be found at http://forge.mysql.com/wiki/6.0_Subquery_Optimization_Cheatsheet.

Marking as closed in 6.0