Bug #21 DISTINCT is not taken in account for index Selection (RightNow)
Submitted: 5 Jan 2003 12:18 Modified: 15 Dec 2003 16:05
Reporter: Peter Zaitsev (Basic Quality Contributor)
Status: Closed
Category:Server Severity:S3 (Non-critical)
Version:4.0 OS:Linux (Linux)
Assigned to: Bugs System Target Version:

[5 Jan 2003 12:18] Peter Zaitsev
Description:
I belive I found a bug in the optimizer. Let me illustrate with the following example: 

The following query is run nightly to clean up each of our databases: 

SELECT DISTINCT c.sessionid FROM clicktrack c LEFT OUTER JOIN incidents i ON i.sessionid =
c.sessionid WHERE c.timestamp < '2002-12-05 01:05:08' AND c.interface_id = 1 AND (i.status
NOT IN (1,9,8) OR i.status IS NULL);

It ran for hours when it usally only takes a couple mintes. 

Here is the explain, notice it isn't using any indexes, I tried forcing it to use an index
with the use index syntax with no change. I also tried optimizing the tables with no
change:

explain SELECT DISTINCT c.sessionid FROM clicktrack c LEFT OUTER JOIN incidents i ON
i.sessionid = c.sessionid WHERE c.timestamp < '2002-12-05 01:05:08' AND c.interface_id = 1
AND (i.status NOT IN (1,9,8) OR i.status IS NULL);

+-------+------+----------------------+------+---------+------+---------+-----------------
------------+ 
| table | type | possible_keys        | key  | key_len | ref  | rows    | Extra           
           | 
+-------+------+----------------------+------+---------+------+---------+-----------------
------------+ 
| c     | ALL  | clicktrack$timestamp | NULL |    NULL | NULL | 1491872 | where used;
Using temporary | 
| i     | ALL  | incidents$sessionid  | NULL |    NULL | NULL |    1461 | where used;
Distinct        | 
+-------+------+----------------------+------+---------+------+---------+-----------------
------------+ 
2 rows in set (0.00 sec) 

Here are the indexes on the incidents table, notice the cardinality value for
incidents$sessionid, it currently 1. There are only two values in this column NULLs and a
1 unique session_id. It's not typical for most of our customers to have only one sessionid
so I decided to add one more (so a total of two rows now have a session_id)

mysql> show index from incidents; 
+-----------+------------+-----------------------+--------------+-------------+-----------
+-------------+----------+--------+---------+

| Table     | Non_unique | Key_name              | Seq_in_index | Column_name | Collation
| Cardinality | Sub_part | Packed | Comment |

+-----------+------------+-----------------------+--------------+-------------+-----------
+-------------+----------+--------+---------+

| incidents |          0 | incidents$ref_no      |            1 | ref_no      | A        
|        1461 |     NULL | NULL   |         |

| incidents |          1 | incidents$status      |            1 | status      | A        
|           4 |     NULL | NULL   |         |

| incidents |          1 | incidents$assigned    |            1 | assigned    | A        
|          54 |     NULL | NULL   |         |

| incidents |          1 | incidents$customer_id |            1 | customer_id | A        
|         132 |     NULL | NULL   |         |

| incidents |          1 | incidents$escldate    |            1 | escldate    | A        
|           1 |     NULL | NULL   |         |

| incidents |          1 | incidents$sessionid   |            1 | sessionid   | A        
|           1 |     NULL | NULL   |         |

+-----------+------------+-----------------------+--------------+-------------+-----------
+-------------+----------+--------+---------+

6 rows in set (0.00 sec) 

update incidents set sessionid='zxyyHFe' where ref_no='001110-000006'; 

optimize table incidents; 

Now Notice the cardinality value is 3. 

mysql> show index from incidents; 
+-----------+------------+-----------------------+--------------+-------------+-----------
+-------------+----------+--------+---------+

| Table     | Non_unique | Key_name              | Seq_in_index | Column_name | Collation
| Cardinality | Sub_part | Packed | Comment |

+-----------+------------+-----------------------+--------------+-------------+-----------
+-------------+----------+--------+---------+

| incidents |          0 | incidents$ref_no      |            1 | ref_no      | A        
|        1461 |     NULL | NULL   |         |

| incidents |          1 | incidents$status      |            1 | status      | A        
|           4 |     NULL | NULL   |         |

| incidents |          1 | incidents$assigned    |            1 | assigned    | A        
|          54 |     NULL | NULL   |         |

| incidents |          1 | incidents$customer_id |            1 | customer_id | A        
|         132 |     NULL | NULL   |         |

| incidents |          1 | incidents$escldate    |            1 | escldate    | A        
|           1 |     NULL | NULL   |         |

| incidents |          1 | incidents$sessionid   |            1 | sessionid   | A        
|           3 |     NULL | NULL   |         |

+-----------+------------+-----------------------+--------------+-------------+-----------
+-------------+----------+--------+---------+

Also notice that in now uses an index. When I run this query now it takes only 1 minute to
run. 

mysql> desc SELECT DISTINCT c.sessionid FROM clicktrack c LEFT OUTER JOIN incidents i ON
i.sessionid = c.sessionid WHERE c.timestamp < '2002-12-05 01:05:08' AND c.interface_id = 1
AND (i.status NOT IN (1,9,8) OR i.status IS NULL);

+-------+------+----------------------+---------------------+---------+-------------+-----
----+-----------------------------+

| table | type | possible_keys        | key                 | key_len | ref         | rows
   | Extra                       |

+-------+------+----------------------+---------------------+---------+-------------+-----
----+-----------------------------+

| c     | ALL  | clicktrack$timestamp | NULL                |    NULL | NULL        |
1491864 | where used; Using temporary |

| i     | ref  | incidents$sessionid  | incidents$sessionid |      12 | c.sessionid |    
487 | where used; Distinct        |

+-------+------+----------------------+---------------------+---------+-------------+-----
----+-----------------------------+

2 rows in set (0.00 sec) 

I can easily get you the tables involved if you would like to do more testing. It appears
to me that the optimizer  incorrectly decides not to use an index if the cardinality is
too low.

On an unrelated note I would like to no why indexes don't ever seem to be used with
columns of type datetime when using '<' '>' and between oporators are used.

How to repeat:
CREATE TABLE sessions (
  id int(10) unsigned NOT NULL auto_increment,
  last_update timestamp(14) NOT NULL,
  PRIMARY KEY  (id)
) TYPE=MyISAM;

--
-- Dumping data for table 'sessions'
--

INSERT INTO sessions VALUES 
(1,20030105122040),(2,20030105122040),(3,20030105122040),(4,20030105122040),(5,20030105122
040),(6,20030105122040),(7,20030105122040),(8,20030105122040),(9,20030105122040),(10,20030
105122040),(11,20030105122040),(12,20030105122040),(13,20030105122040),(14,20030105122040)
,(15,20030105122040),(16,20030105122040),(17,20030105122040),(18,20030105122040),(19,20030
105122040),(20,20030105122040),(21,20030105122040),(22,20030105122040),(23,20030105122040)
,(24,20030105122040),(25,20030105122040),(26,20030105122040),(27,20030105122040),(28,20030
105122040),(29,20030105122040),(30,20030105122040),(31,20030105122040),(32,20030105122040)
,(33,20030105122040),(34,20030105122040),(35,20030105122040),(36,20030105122040),(37,20030
105122040),(38,20030105122040);

CREATE TABLE session_data (
  session_id int(10) unsigned NOT NULL default '0',
  var char(20) NOT NULL default '',
  KEY session_id (session_id)
) TYPE=MyISAM;

--
-- Dumping data for table 'session_data'
--

INSERT INTO session_data VALUES 
(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1
,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'
a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'
),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a'),(1,'a');

explain select distinct id from sessions left outer join 
session_data on id=session_id where last_update<now() and 
(var="a" or var is null);

Suggested fix:
Workaround:

add index (session_id,var) to session_data table to change access type to the table to
index covered.
[10 Jan 2003 0:23] MySQL Developer
Thank you for your bug report. This issue has already been fixed
in the latest released version of that product, which you can download at 
http://www.mysql.com/downloads/

This particular case (search on NULL) is fixed in 4.0.9