MySQL Bugs: #36817: Non optimal index choice, depending on index creation order

Bug #36817	Non optimal index choice, depending on index creation order
Submitted:	20 May 2008 13:19	Modified:	5 Jul 2009 12:24
Reporter:	jocelyn fournier (Silver Quality Contributor)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S5 (Performance)
Version:	5.0.54a,5.1.19, 5.1.25, 5.0.60, 5.5	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	qc

Description:
Hi,

If MySQL has the choice between two indexes, one which cover the ORDER BY clause, and another one which doesn't cover it, MySQL will choose the first created index, even if it has to use File Sorting.

Regards,
  Jocelyn Fournier

How to repeat:
DROP TABLE IF EXISTS a;

CREATE TABLE `a` (
  `id1` int(10) unsigned NOT NULL auto_increment,
  `id2` tinyint(3) unsigned NOT NULL default '0',
  `id3` tinyint(3) unsigned NOT NULL default '0',
  `id4` int(10) unsigned NOT NULL default '0',
  `date` timestamp NOT NULL default CURRENT_TIMESTAMP,
  PRIMARY KEY  (`id1`),
  KEY `id2` (`id2`,`id3`,`id4`,`date`),
  KEY `id2_2` (`id2`,`id3`,`date`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

INSERT INTO a (id2,id3,id4) VALUES (1,1,1),(1,1,1),(1,1,1),(1,1,1),(1,0,1),(1,2,1),(1,3,1);

EXPLAIN SELECT id1 FROM a WHERE id2=1 AND id3=1 ORDER BY date DESC LIMIT 0,4;
+----+-------------+-------+------+---------------+------+---------+-------------+------+-----------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref         | rows | Extra                       |
+----+-------------+-------+------+---------------+------+---------+-------------+------+-----------------------------+
|  1 | SIMPLE      | a     | ref  | id2,id2_2     | id2  | 2       | const,const |    3 | Using where; Using filesort |
+----+-------------+-------+------+---------------+------+---------+-------------+------+-----------------------------+
1 row in set (0.00 sec)

=> Here MySQL chooses the first index which is non optimal and hence has to do filesorting.

Now let's swap the index declaration order :

DROP TABLE IF EXISTS a;
CREATE TABLE `a` (
  `id1` int(10) unsigned NOT NULL auto_increment,
  `id2` tinyint(3) unsigned NOT NULL default '0',
  `id3` tinyint(3) unsigned NOT NULL default '0',
  `id4` int(10) unsigned NOT NULL default '0',
  `date` timestamp NOT NULL default CURRENT_TIMESTAMP,
  PRIMARY KEY  (`id1`),
  KEY `id2` (`id2`,`id3`,`date`),
  KEY `id2_2` (`id2`,`id3`,`id4`,`date`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

INSERT INTO a (id2,id3,id4) VALUES (1,1,1),(1,1,1),(1,1,1),(1,1,1),(1,0,1),(1,2,1),(1,3,1);

EXPLAIN SELECT id1 FROM a WHERE id2=1 AND id3=1 ORDER BY date DESC LIMIT 0,4;

+----+-------------+-------+------+---------------+------+---------+-------------+------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref         | rows | Extra       |
+----+-------------+-------+------+---------------+------+---------+-------------+------+-------------+
|  1 | SIMPLE      | a     | ref  | id2,id2_2     | id2  | 2       | const,const |    3 | Using where |
+----+-------------+-------+------+---------------+------+---------+-------------+------+-------------+

MySQL still uses the first index, which is now optimal (no filesorting).

Suggested fix:
Always choose the best index, with no filesorting.

Thank you for a bug report. Verified just as described with 5.1.25:

C:\Program Files\MySQL\MySQL Server 5.0\bin>mysql -uroot -proot -P3310 test
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.1.25-rc-community MySQL Community Server (GPL)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> DROP TABLE IF EXISTS a;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql>
mysql> CREATE TABLE `a` (
    ->   `id1` int(10) unsigned NOT NULL auto_increment,
    ->   `id2` tinyint(3) unsigned NOT NULL default '0',
    ->   `id3` tinyint(3) unsigned NOT NULL default '0',
    ->   `id4` int(10) unsigned NOT NULL default '0',
    ->   `date` timestamp NOT NULL default CURRENT_TIMESTAMP,
    ->   PRIMARY KEY  (`id1`),
    ->   KEY `id2` (`id2`,`id3`,`id4`,`date`),
    ->   KEY `id2_2` (`id2`,`id3`,`date`)
    -> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (0.06 sec)

mysql>
mysql> INSERT INTO a (id2,id3,id4) VALUES
    -> (1,1,1),(1,1,1),(1,1,1),(1,1,1),(1,0,1),(1,2,1),(1,3,1);
Query OK, 7 rows affected (0.06 sec)
Records: 7  Duplicates: 0  Warnings: 0

mysql> EXPLAIN SELECT id1 FROM a WHERE id2=1 AND id3=1 ORDER BY date DESC LIMIT
0,4;
+----+-------------+-------+------+---------------+------+---------+------------
-+------+-----------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref
 | rows | Extra                       |
+----+-------------+-------+------+---------------+------+---------+------------
-+------+-----------------------------+
|  1 | SIMPLE      | a     | ref  | id2,id2_2     | id2  | 2       | const,const
 |    3 | Using where; Using filesort |
+----+-------------+-------+------+---------------+------+---------+------------
-+------+-----------------------------+
1 row in set (0.05 sec)

mysql> DROP TABLE IF EXISTS a;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE `a` (
    ->   `id1` int(10) unsigned NOT NULL auto_increment,
    ->   `id2` tinyint(3) unsigned NOT NULL default '0',
    ->   `id3` tinyint(3) unsigned NOT NULL default '0',
    ->   `id4` int(10) unsigned NOT NULL default '0',
    ->   `date` timestamp NOT NULL default CURRENT_TIMESTAMP,
    ->   PRIMARY KEY  (`id1`),
    ->   KEY `id2` (`id2`,`id3`,`date`),
    ->   KEY `id2_2` (`id2`,`id3`,`id4`,`date`)
    -> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (0.06 sec)

mysql>
mysql> INSERT INTO a (id2,id3,id4) VALUES
    -> (1,1,1),(1,1,1),(1,1,1),(1,1,1),(1,0,1),(1,2,1),(1,3,1);
Query OK, 7 rows affected (0.00 sec)
Records: 7  Duplicates: 0  Warnings: 0

mysql> EXPLAIN SELECT id1 FROM a WHERE id2=1 AND id3=1 ORDER BY date DESC LIMIT
0,4;
+----+-------------+-------+------+---------------+------+---------+------------
-+------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref
 | rows | Extra       |
+----+-------------+-------+------+---------------+------+---------+------------
-+------+-------------+
|  1 | SIMPLE      | a     | ref  | id2,id2_2     | id2  | 2       | const,const
 |    3 | Using where |
+----+-------------+-------+------+---------------+------+---------+------------
-+------+-------------+
1 row in set (0.00 sec)

Same problem verified with 5.0.60.

triage: need e/r values (before setting P3 to determine target

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/61512

2744 Georgi Kodinov	2008-12-12
      Bug #36817: Non optimal index choice, depending on index creation order
      
      When selecting among the available indexes for REF access the optimizer can take 
      into account that certain indexes may require sorting for ORDER BY and some may not.
      This can be done by putting extra weight to the indexes that would require filesort.
      Note that this extra weight should not be added when comparing with the other access
      methods (e.g. quick access) because it's considered at a later stage anyway.

Hi,

Actually that doesn't always concern file sorting / ORDER BY clause (which is quite annoying) :

With 5.1.36 :

Case 1:

DROP TABLE IF EXISTS a;

CREATE TABLE `a` (
  `id1` int(10) unsigned NOT NULL auto_increment,
  `id2` tinyint(3) unsigned NOT NULL default '0',
  `id3` tinyint(3) unsigned NOT NULL default '0',
  `id4` int(10) unsigned NOT NULL default '0',
  `date` timestamp NOT NULL default CURRENT_TIMESTAMP,
  PRIMARY KEY  (`id1`),
  KEY `id2` (`id2`,`id3`,`id4`,`date`),
  KEY `id2_2` (`id2`,`id3`,`date`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

INSERT INTO a (id2,id3,id4) VALUES
(1,1,1),(1,1,1),(1,1,1),(1,1,1),(1,0,1),(1,2,1),(1,3,1);
EXPLAIN SELECT id1 FROM a WHERE id2=1 AND id3=1 AND id4=1;
+----+-------------+-------+------+---------------+------+---------+-------------------+------+-------+
| id | select_type | table | type | possible_keys | key  | key_len | ref               | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+-------------------+------+-------+
|  1 | SIMPLE      | a     | ref  | id2,id2_2     | id2  | 6       | const,const,const |    3 |       |
+----+-------------+-------+------+---------------+------+---------+-------------------+------+-------+

=> id2 is chosen here, which is ok
Now let's swap the index declaration order :

Case 2:

DROP TABLE IF EXISTS a;
CREATE TABLE `a` (
  `id1` int(10) unsigned NOT NULL auto_increment,
  `id2` tinyint(3) unsigned NOT NULL default '0',
  `id3` tinyint(3) unsigned NOT NULL default '0',
  `id4` int(10) unsigned NOT NULL default '0',
  `date` timestamp NOT NULL default CURRENT_TIMESTAMP,
  PRIMARY KEY  (`id1`),
  KEY `id2` (`id2`,`id3`,`date`),
  KEY `id2_2` (`id2`,`id3`,`id4`,`date`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

INSERT INTO a (id2,id3,id4) VALUES
(1,1,1),(1,1,1),(1,1,1),(1,1,1),(1,0,1),(1,2,1),(1,3,1);

 EXPLAIN SELECT id1 FROM a WHERE id2=1 AND id3=1 AND id4=1;
+----+-------------+-------+------+---------------+------+---------+-------------+------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref         | rows | Extra       |
+----+-------------+-------+------+---------------+------+---------+-------------+------+-------------+
|  1 | SIMPLE      | a     | ref  | id2,id2_2     | id2  | 2       | const,const |    3 | Using where |
+----+-------------+-------+------+---------------+------+---------+-------------+------+-------------+

Here id2 is selected, which is not the optimal index.

Result with 5.0.75, the behaviour is swapped : 

Case 1 :

EXPLAIN SELECT id1 FROM a WHERE id2=1 AND id3=1 AND id4=1;
+----+-------------+-------+------+---------------+-------+---------+-------------+------+-------------+
| id | select_type | table | type | possible_keys | key   | key_len | ref         | rows | Extra       |
+----+-------------+-------+------+---------------+-------+---------+-------------+------+-------------+
|  1 | SIMPLE      | a     | ref  | id2,id2_2     | id2_2 | 2       | const,const |    3 | Using where |
+----+-------------+-------+------+---------------+-------+---------+-------------+------+-------------+

=> here id2_2 is used (the wrong one)

Case 2 : 

EXPLAIN SELECT id1 FROM a WHERE id2=1 AND id3=1 AND id4=1;
+----+-------------+-------+------+---------------+-------+---------+-------------------+------+-------+
| id | select_type | table | type | possible_keys | key   | key_len | ref               | rows | Extra |
+----+-------------+-------+------+---------------+-------+---------+-------------------+------+-------+
|  1 | SIMPLE      | a     | ref  | id2,id2_2     | id2_2 | 6       | const,const,const |    3 |       |
+----+-------------+-------+------+---------------+-------+---------+-------------------+------+-------+

=> Ok

Any plan to fix this a little bit earlier than 6.0 ? To fix this issue, I have to heavily hack my application adding USE INDEX, which is not really acceptable for such "standard" queries.

Thanks and regards,

Bug #59131 was marked as a duplicate of this one.

I did some research on this issue and (wrongly) submitted bug#59131. Here are copy from it.

create table a (id int auto_increment primary key,
k1 int, k2 int, k3 int, value int,
index i1 (k2, k1, k3), index i2 (k2, k3)) engine=innodb;
Then insert lots of rows with low enough cardinality on k2 (I tested
inserting 2.6 mil rows, cardinality on k2 was 1000):

insert into a (k1, k2, k3, value)values (1,1,1,1),
(2,2,2,2),(3,3,3,3),(4,4,4,4),(5,5,5,5),(6,6,6,6),(7,7,7,7),(8,8,8,8),(9,9,9,9),(10,10,10,10);
insert into a (k1, k2, k3, value) select rand()*1000000, rand()*1000,
rand()*1000000, value from a;
insert into a (k1, k2, k3, value) select rand()*1000000, rand()*1000,
rand()*1000000, value from a;
insert into a (k1, k2, k3, value) select rand()*1000000, rand()*1000,
rand()*1000000, value from a;
....

insert into a (k1, k2, k3, value) select rand()*1000000, rand()*1000,
rand()*1000000, value from a;
Query OK, 1310720 rows affected (51.06 sec)
Records: 1310720  Duplicates: 0  Warnings: 0

mysql> explain select * from a where k2=5 order by k3 limit 2;
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key  | key_len |
ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | a     | range | i1,i2         | i2   | 5       |
NULL | 2647 | Using where |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)

When I checked with a debugger, the following lines where executed to
check whether i2 should be used or not.

 sql/sql_select.cc test_if_cheaper_ordering():
(test_if_skip_sort_order() in 5.1)
   if (keys.is_set(nr) &&
       (direction= test_if_order_by_key(order, table, nr,
&used_key_parts)))
...
       /*
         We assume that each of the tested indexes is not correlated
         with ref_key. Thus, to select first N records we have to
scan
         N/selectivity(ref_key) index entries.
         selectivity(ref_key) = #scanned_records/#table_records =
         table->quick_condition_rows/table_records.
         In any case we can't select more than #table_records.
         N/(table->quick_condition_rows/table_records) >
table_records
         <=> N > table->quick_condition_rows.
       */
       if (select_limit > table->quick_condition_rows)
         select_limit= table_records;
       else
         select_limit= (ha_rows) (select_limit *
                                  (double) table_records /
                                   table->quick_condition_rows);
....

       index_scan_time= select_limit/rec_per_key *
                        min(rec_per_key, table->file->scan_time());
       if ((ref_key < 0 && is_covering) ||
           (ref_key < 0 && (group || table->force_index)) ||
           index_scan_time < read_time)
----

 table_records: close to the number of rows (2.6M)
 table->quick_condition_rows: estimated matched rows (2.6M / 1000 =
2.6K)
 select_limit: originally the same as LIMIT N (20), finally
select_limit * table_records / table->quick_condition_rows (20 * 2.6M /
2.6K = 20K)
 index_scan_time: the same as the final select_limit value
 read_time: (probably) the same as table->quick_condition_rows (2.6K)

 read_time was the same as the estimated cost when using index i1. The
MySQL optimizer decided that the cost of using i2 (index_scan_time:
20k) was much more expensive than the cost of using i1 (2.6k) so it
chose i1.
 But apparently this is wrong decision. I am not sure why
select_limit(index_scan_time) was computed by multiplying
table_records. Since it was index scan and sorting could be skipped,
select_limit should be at least less than table->quick_condition_rows,
shouldn't it?