MySQL Bugs: #31188: Optimizer does not do index range scans using predicates in IN lists

Bug #31188	Optimizer does not do index range scans using predicates in IN lists
Submitted:	25 Sep 2007 16:52	Modified:	19 Nov 2013 13:53
Reporter:	Mark Callaghan	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S5 (Performance)
Version:	5.0.37, 5.5	OS:	Any
Assigned to:	Martin Hansson	CPU Architecture:	Any
Tags:	bfsm_2007_10_18, bfsm_2007_11_01, bfsm_2007_12_06, INDEX, inlist, Optimizer, range

Description:
The optimizer is able to use an index range scan on a composite index when the predicates are in disjunctive normal form. It does a full index scan when the semantically equivalent predicates are represented using a multi-value in-list
* where (i=0 and j=0) or (i=1 and j=1) --> uses a range scan on an index on (i,j)
* where (i,j) in ((0,0), (1,1)) --> uses a full index scan on an index on (i,j)

I don't think this is related to http://bugs.mysql.com/bug.php?id=26232

How to repeat:
 create table c (i int, j int);
 create index x on c(i,j);
insert into c values (0,0), (1,1), (2,2), (3,3), (4,4), (5,5), (6,6), (7,7), (8,8), (9,9);
insert into c select * from c; insert into c select * from c;
insert into c select * from c; insert into c select * from c;
insert into c select * from c; insert into c select * from c;
insert into c select * from c; insert into c select * from c;
insert into c select * from c; insert into c select * from c;
insert into c select * from c; insert into c select * from c;
insert into c select * from c; insert into c select * from c;
insert into c select * from c; insert into c select * from c;
analyze table c;
explain select count(*) from c where (i=0 and j=0) or (i=1 and j=1);
explain select count(*) from c where (i,j) in ((0,0), (1,1));
mysql> explain select count(*) from c where (i=0 and j=0) or (i=1 and j=1);
+----+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows  | Extra                    |
+----+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
|  1 | SIMPLE      | c     | range | x             | x    | 10      | NULL | 58217 | Using where; Using index |
+----+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
1 row in set (0.01 sec)

mysql> explain select count(*) from c where (i,j) in ((0,0), (1,1));
+----+-------------+-------+-------+---------------+------+---------+------+--------+--------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows   | Extra                    |
+----+-------------+-------+-------+---------------+------+---------+------+--------+--------------------------+
|  1 | SIMPLE      | c     | index | NULL          | x    | 10      | NULL | 327680 | Using where; Using index |
+----+-------------+-------+-------+---------------+------+---------+------+--------+--------------------------+
1 row in set (0.00 sec)

The performance difference is significant once there are enough rows in the table and especially when the selectivity is high. The worst I can do is a factor of 10X with a selectivity of 10%.

mysql> select count(*) from c where (i,j) in ((0,0));
+----------+
| count(*) |
+----------+
|   131072 |
+----------+
1 row in set (1.83 sec)

mysql> select count(*) from c where (i=0 and j=0);
+----------+
| count(*) |
+----------+
|   131072 |
+----------+
1 row in set (0.19 sec)

Suggested fix:
I assume MySQL represents IN lists differently than lists of OR and AND terms.

Thank you for the bug report. Verified as described.

This appears to be a known issue.  In the documentation at:

http://dev.mysql.com/doc/refman/5.0/en/subquery-restrictions.html

The following appears:

==
# Row constructors are not well optimized. The following two expressions are equivalent, but only the second can be optimized:

(col1, col2, ...) = (val1, val2, ...)
col1 = val1 AND col2 = val2 AND ...
==

This looks to be one of those cases where it can't be optimized well.

Analysis
========
The range optimizer is not able to handle row comparisons in form 
  
  (a1,a2, ...) = (b1, b2, ...) 

Single comparison handling is done by converting 

  (a1, a2, ...) = (b1, b2, ...) 

to 

  a1=b1 AND a2=b2 AND ...

The conversion has been introduced as fix for BUG#16081. It is performed by 
check_row_equality() function which invoked at the equality propagation
phase.  The obtained equalities are also taken into account by equality
propagation.

In general case, "a IN (b,c,d, ...)" cannot used by construct equality sets.
Consequently, it is ignored by equality propagation and is not converted to
AND/OR. Range optimizer ignores tuple comparisons, therefore predicates like

  (a1, a2, ...) IN ((b11,b12,...), (c11,c12, ...), ...) 

are non-sargable in MySQL.

The quote from the manual posted by Harrison Fisk is not true after BUG#16081. I'll notify the docs team.

More analysis
=============
At the moment range optimizer processes only IN predicates that have form

  t.keypart IN (const1, ... constN)

and that case is a special case for IN expression - it computes a sorted
list of {const_i} which is used by range analyzer to quickly construct
the corresponding SEL_TREE.

Predicates in other forms, e.g.  "2 IN (3,a)" are not processed.

Re the fix
==========
We consider this bug to be a feature request.

With tuple-based comparisons it is not possible to pre-sort the IN list as
we have no idea how it should be sorted. 

It seems the fix is to make the range optimizer process 
  
    (a1, a2, ...) IN ((b11,b12,...), (c11,c12, ...), ...) 

in the same way as it would have processed an equvalent AND/OR formula.

Possible hazards
~~~~~~~~~~~~~~~~
1. Type inference. There were some counter-intuitive rules re what should
MySQL do when the IN list contains constants of different types. We should
take care not to make any accidental modifications to semantics of
tuple-based IN comparison.

2. Combinatorial blowup. 
the IN-list might be big. We need to check if we might use combinatorial
amounts of time/space when processing it.

Need to ask Mark Callaghan if he actually needs to process very big
IN-lists.

3. If we implement conversion of tuple-based IN comparsion to AND/OR
formula, won't then somebody come and demand that we handle scalar-value
based predicates with non-constants in the right part (e.g.  "2 IN (3,a)")
too? We did have such requests for BETWEEN.
(maybe disable processing of the tuple-based INs if they have non-constants on 
the right side?)

Where/if we should fix it
~~~~~~~~~~~~~~~~~~~~~~~~~
I consider it a feature request. Do we have the feature trees already? If yes we might fix this bug there.

I agree that this is a feature request rather than a bug report. I have many queries with large in-lists, but most of them are single column inlists:
   col in (<thousands of values>)
Predicates with multi-column inlists tend to have few values.

Bug #40029 was marked as a duplicate of this one.

Bug#35819 was marked as a duplicate of this one.

The docs should be updated. They don't explain the feature request for which this bug is open:
>>>
Prior to MySQL 5.0.26, row constructors were not well optimized; of the following two equivalent expressions, only the second could be optimized:

(col1, col2, ...) = (val1, val2, ...)
col1 = val1 AND col2 = val2 AND ...

In MySQL 5.0.26 and later, all row equalities are converted into conjunctions of equalities between row elements, and handled by the optimizer in the same way. (Bug#16081)
>>>

This is the same as http://bugs.mysql.com/bug.php?id=16247, but the author of this feature request is much more convincing.

hehe.

Bug #50571 was marked as a duplicate of this one.

In my humble opinion, this is still not a bug, but feature request. This feature request, however, should be very high on TODO list, as it would make a feature that is sorely missing, particularly since tuple equality expressions have been already optimized.

i put some test files at bug 61541
if anyone want to test with it...

just an idea...
maybe some work done at bug 16081, could help here too (same part of changed code)

http://lists.mysql.com/commits/11247

Bug #61541 was marked as a duplicate of this one.

Bug #63167 was marked as a duplicate of this one.

http://bugs.mysql.com/bug.php?id=64706 marked as duplicate of this one.

Implemented in 5.7.3.

Noted in 5.7.3 changelog.

The optimizer now is able to apply the range scan access method to
queries of this form:

SELECT ... FROM t1 WHERE ( col_1, col_2 ) IN (( 'a', 'b' ), ( 'c', 'd' ));

Previously, for range scans to be used it was necessary for the query
to be written as:

SELECT ... FROM t1 WHERE ( col_1 = 'a' AND col_2 = 'b' )
OR ( col_1 = 'c' AND col_2 = 'd' );

For the optimizer to use a range scan, queries must satisfy these
conditions:

* Only IN predicates are used, not NOT IN.

* On the left side of the IN predicate, the row constructor contains
  only column references.

* On the right side of the IN predicate, row constructors contain only
  runtime constants, which are either literals or local column
  references that are bound to constants during execution.

* On the right side of the IN predicate, there is more than one row
  constructor.

EXPLAIN output for applicable queries changes from full table scan or
index scan to range scan. Changes are also visible by checking the
values of the Handler_read_first, Handler_read_key, and
Handler_read_next status variables.

This feature request was implemented as WL#7019 "Add support for row value constructors in in predicates to range optimizer".