Bug #10561 "NOT field IN(valuelist)" doesn't use indexes when it could
Submitted: 11 May 2005 16:21 Modified: 23 May 2005 1:51
Reporter: Dan Nelson Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S4 (Feature request)
Version:4.1.11 OS:FreeBSD (FreeBSD)
Assigned to: Igor Babaev CPU Architecture:Any

[11 May 2005 16:21] Dan Nelson
Description:
a query of the form SELECT ... WHERE NOT field IN('value1','value2','value3') should optimize the same as SELECT ... WHERE field <> 'value1' AND field <> 'value2' AND field <> 'value3', but it doesn't always.  Queries selecting only the field itself will use an index on field if available, but selecting any other field causes it to not even consider the index.

How to repeat:
Run the following:

DROP TABLE IF EXISTS test;
CREATE TABLE test (
  id int(11) NOT NULL auto_increment,
  status varchar(20),
  PRIMARY KEY  (id),
  KEY (status)
);

INSERT INTO test VALUES (1,'B'), (2,'B'), (3,'B'), (4,'B'), (5,'B'), (6,'B'), (7,'B'), (8,'B'), (9,'B'), (10,'B'), (11,'B'), (12,'B'), (13,'B'), (14,'B'), (15,'B'), (16,'B'), (17,'B'), (18,'B'), (19,'B'), (20,'B'), (21,'B'), (22,'B'), (23,'B'), (24,'B'), (25,'A'), (26,'A'), (27,'A'), (28,'A'), (29,'A'), (30,'A'), (31,'A'), (32,'A'), (33,'A'), (34,'A'), (35,'A'), (36,'A'), (37,'A'), (38,'A'), (39,'A'), (40,'A'), (41,'A'), (42,'A'), (43,'A'), (44,'A'), (45,'A'), (46,'A'), (47,'A'), (48,'A'), (49,'A'), (50,'A'), (51,'A'), (52,'A'), (53,'C'), (54,'C'), (55,'C'), (56,'C'), (57,'C'), (58,'C'), (59,'C'), (60,'C');

EXPLAIN SELECT status FROM test WHERE status NOT IN ('A','B');
EXPLAIN SELECT * FROM test WHERE status NOT IN ('A','B');
EXPLAIN SELECT * FROM test WHERE status <> 'A' AND status <> 'B';

The results:

+----+-------------+-------+-------+---------------+--------+---------+------+------+--------------------------+
| id | select_type | table | type  | possible_keys | key    | key_len | ref  | rows | Extra                    |
+----+-------------+-------+-------+---------------+--------+---------+------+------+--------------------------+
|  1 | SIMPLE      | test  | index | NULL          | status |      21 | NULL |   60 | Using where; Using index |
+----+-------------+-------+-------+---------------+--------+---------+------+------+--------------------------+

+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | test  | ALL  | NULL          | NULL |    NULL | NULL |   60 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+

+----+-------------+-------+-------+---------------+--------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key    | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+--------+---------+------+------+-------------+
|  1 | SIMPLE      | test  | range | status        | status |      21 | NULL |   11 | Using where |
+----+-------------+-------+-------+---------------+--------+---------+------+------+-------------+

The second query wasn't able to use the index on status, even though the equivalent third query did. 

Suggested fix:
Not sure.  I thought that "field IN(valuelist)" was converted internally to "field = 'value1' OR field = 'value2'...", but apparently not.

Rearranging the operators ( changing "status NOT IN()" to "NOT status IN()" in Q2, or converting the <> clause to "NOT (status='A' or status='B')" in Q3 ) doesn't affect the results.
[11 May 2005 18:20] Hartmut Holzgraefe
At first i thought that this was because first the rather big result of IN() would be 
processed and only after that it would be negated, leading to a full table scan as
most rows match the IN() part.

But then i tried the following:

EXPLAIN SELECT * FROM test WHERE NOT ( status ='A' OR status='B');

+----+-------------+-------+-------+---------------+--------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key    | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+--------+---------+------+------+-------------+
|  1 | SIMPLE      | test  | range | status        | status | 23      | NULL |   11 | Using where |
+----+-------------+-------+-------+---------------+--------+---------+------+------+-------------+

so this might not necessarily be a bug but at least someone with more detail knowledge than me should look over ...
[21 May 2005 8:39] Igor Babaev
This is actually an optimization request.
[21 May 2005 13:12] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/25156
[21 May 2005 14:09] Igor Babaev
The optimization feature will appear in release 5.0.7.
It could be applied to 4.1 as well, but our policy prohibits us to add new features in production releases.

ChangeSet
  1.1894 05/05/21 06:11:44 igor@rurik.mysql.com +4 -0
  range.result, range.test:
    Added test cases for optimization request #10561.
  opt_range.cc, sql_select.cc:
    Fixed bug #10561: an optimization request to allow
    range analysis for NOT IN and NOT BETWEEN.
[23 May 2005 1:51] Paul DuBois
Noted in 5.0.7 changelog.