MySQL Bugs: #26106: Wrong plan may be chosen when there are several possible range and ref accesses

Bug #26106	Wrong plan may be chosen when there are several possible range and ref accesses
Submitted:	6 Feb 2007 8:15	Modified:	13 Aug 2010 2:31
Reporter:	Sergey Petrunya	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S3 (Non-critical)
Version:	5.0.33	OS:	Any
Assigned to:	Martin Hansson	CPU Architecture:	Any

Description:
Query optimizer may choose an apparently sub-optimal QEP when there are 
several possible range and ref accesses.

How to repeat:
The case is similar to one described here
http://www.mysqlperformanceblog.com/2007/01/31/getting-mysql-to-use-full-key-length/

but it is rather complicated so I'll post more details in separate comments.

How to repeat: 

1. Create and fill the table:
CREATE TABLE `link_in85` (
  `domain_id` int(10) unsigned NOT NULL,
  `link_id` int(10) unsigned NOT NULL auto_increment,
  `url_from` varchar(255) NOT NULL,
  `url_to` varchar(255) NOT NULL,
  `anchor` varchar(255) NOT NULL,
  `from_site_id` int(10) unsigned NOT NULL,
  `from_forum_id` int(10) unsigned NOT NULL,
  `from_message_id` int(10) unsigned NOT NULL,
  `message_published` timestamp NOT NULL default CURRENT_TIMESTAMP,
  `kind` enum('link','img') NOT NULL,
  `url_title` varchar(255) NOT NULL,
  `isexternal` tinyint(3) unsigned NOT NULL,
  `revert_domain` varchar(255) NOT NULL,
  `url_prefix` varchar(255) NOT NULL,
  `from_domain_id` int(10) unsigned NOT NULL,
  `ext` varchar(25) NOT NULL,
  `linktype` enum('html','video','mp3','image','pdf','other') NOT NULL,
  `message_day` date NOT NULL,
  PRIMARY KEY  (`link_id`),
  UNIQUE KEY `domain_id_2` (`domain_id`,`link_id`),
  KEY `url_to` (`url_to`),
  KEY `domain_id` (`domain_id`,`from_site_id`,`message_published`),
  KEY `revert_domain` (`revert_domain`,`url_prefix`(80)),
  KEY `domain_id_3`    (`domain_id`,`message_day`,`revert_domain`,`url_prefix`,`from_site_id`,`isexternal`),
  KEY `domain_message` (`domain_id`,`message_day`,`isexternal`,`revert_domain`,`url_prefix`,`from_site_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

fill it with certain data (you need precise dataset from PeterZ, I have it)

Problem #1:
Run the query:
explain SELECT
  url_to,    
  count(*) cnt,
  revert_domain,
  url_prefix,
  count(distinct from_site_id) uniq  
FROM    
  link_in85 IGNORE INDEX (domain_message) 
WHERE    
  domain_id=434876 AND
  kind='link' AND
  isexternal=1 AND    
  message_day IN ('2007-01-01','2007-01-02','2007-01-03','2007-01-04',
                  '2007-01-05','2007-01-06','2007-01-07','2007-01-08',
                  '2007-01-09','2007-01-10','2007-01-11','2007-01-12',
                  '2007-01-13','2007-01-14','2007-01-15','2007-01-16',
                  '2007-01-17','2007-01-18','2007-01-19','2007-01-20',
                  '2007-01-21','2007-01-22','2007-01-23','2007-01-24',
                  '2007-01-25','2007-01-26','2007-01-27','2007-01-28',
                  '2007-01-29','2007-01-30','2007-01-31') 
GROUP BY 
  concat(revert_domain,url_prefix)  
ORDER BY 
 uniq desc, cnt desc
LIMIT 10\G

domain_id_2: rows 933660  cost 1.12039e+06 
domain_id:   rows 1931598 cost 2.31792e+06
domain_id_3: rows 2599902 cost 3.11991e+06

Range analyzer result is:
range plan for key domain_id_2, cost 1.12039e+06, records 933660

quick range select, key domain_id_2, length: 4
  434876 <= X <= 434876

best_access_path():
  key=1: (domain_id_2): ReuseRangeEstimateForRef hits, rows=933660.
  key=3: domain_id :  ReuseRangeEstimateForRef hits, rows=1931598.
  key=5L domain_id_3: 
    quick_key_parts=2, nranges=31
    ref's key_parts=1 
      ReuseRangeEstimateForRef does not hit, 
      And here we use rec-per-key statistics of #rows=302992

The problem is: We shouldn't even consider ref(const) when it is possible to use range access over more key parts.

Another problem: consider this EXPLAIN:

explain SELECT
  url_to,    
  count(*) cnt,
  revert_domain,
  url_prefix,
  count(distinct from_site_id) uniq  
FROM 
  link_in85 IGNORE INDEX (domain_message, domain_id_2) 
WHERE 
  domain_id=434876 AND
  kind='link' AND    
  isexternal=1 AND    
  message_day IN ('2007-01-01','2007-01-02','2007-01-03','2007-01-04',
                  '2007-01-05','2007-01-06','2007-01-07','2007-01-08',
                  '2007-01-09','2007-01-10','2007-01-11','2007-01-12',
                  '2007-01-13','2007-01-14','2007-01-15','2007-01-16',
                  '2007-01-17','2007-01-18','2007-01-19','2007-01-20',
                  '2007-01-21','2007-01-22','2007-01-23','2007-01-24',
                  '2007-01-25','2007-01-26','2007-01-27','2007-01-28',
                  '2007-01-29','2007-01-30','2007-01-31') 
GROUP BY 
  concat(revert_domain,url_prefix)  
ORDER BY 
  uniq desc, cnt desc
LIMIT 10\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: link_in85
         type: ref
possible_keys: domain_id,domain_id_3
          key: domain_id_3
      key_len: 4
          ref: const
         rows: 302992
        Extra: Using where; Using temporary; Using filesort
1 row in set (6 min 5.37 sec)

Here we have this scenario:

Range analyzer:
  find those two candidate accesses:
  key domain_id: found_read_time: 2.31792e+06 (cur. read_time: 1.18537e+06)
  rows= 1931598

  key domain_id_3: found_read_time: 3.11991e+06 (cur. read_time: 1.18537e+06)
  rows= 2599902
  
Both of them are considered too expensive. So we destroy all quick selects.

Best_access_path: 

ReuseRangeEstimateForRef doesn't fire for key domain_id_3,
we use the rec_per_key (index statistics) estimate, and decide that
using the index is better than not.
 ^ Here we face the same problem as described in previous case ^

Since quick select already has been destroyed, we can't use it and use ref access over a broader range.
 ^ Here is another possible issue: are there any cases where we'll need to
   re-create the quick select we've previously destroyed? It seems that once 
   we have full ReuseRangeEstimateForRef + "use range if it scans a narrower
   set" we'll never need to re-create a quick select but this needs to be
   checked.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/22131

ChangeSet@1.2466, 2007-03-16 15:00:50+01:00, mhansson@linux-st28.site +3 -0
  bug #26106: Wrong plan may be chosen when there are several possible range and ref 
  accesses
  
  Credits go to Sergey Petrunia for this description of the bug.
  
  For ref(const) access methods there are two kinds of statistics when
  the WHERE has an AND-part of  "t.keyXpart1 = c1 AND ... t.keyXpartN = cN", 
  N < key size.
  
  1. Table statistics, i.e. index cardinality values
  2. Range access estimate
  
  those estimates are not necessarily consistent with each other
  for example we can get
  
    E(#rows(t.key1part1 = c1)) < E( #rows( t.key1part1 = c1 
                                           AND     
                                           t.key1part2 IN (1, 2) ) 
                                  )
  or 
     
    E(#rows(t.key1part1 = c1)) < E( #rows( t.key1part1 = c1
                                         AND
                                         t.key1part2 = othertbl.field )
                                  )
  
  which is obviously not true. The source of discrepancy is that index
  cardinality values can be very inaccurate for skewed datasets. The code
  marked as ReuseRangeEstimateForRef-[1-4] tries to do a smart job of using
  the best available estimate. e.g. if we have candidate accesses:
  
    * B = ref-access( t.key1part1 = c1 AND t.key1part2 = othertbl.field )
    * A = range-access( t.key1part1 = c1 )
  
  and we have the E( #rows(A) ) <  E( #rows(A) ) as above, we'll pick
  the #rows estimate from range access and use it with ref access. (we
  will use ref access as it uses more keyparts, that's a heuristic)
  
  Ok, when we have range and ref over the same index we have these rules:
  
    * When access method X scans a subset of rows that access method Y will
      scan, use X, no matter what the cost numbers say.
  
    * Look if the obtained cost estimates are consistent. If there is an
      inconsistency, eliminate it by using range access estimates instead of
      ref.
  
  The bug is that the above rules weren't followed. best_access_path always chose
  the index with the lowest number of records, even if the figure wasn't reliable.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/22323

ChangeSet@1.2466, 2007-03-20 08:24:17+01:00, mhansson@linux-st28.site +3 -0
  bug #26106: Wrong plan may be chosen when there are several possible range and ref 
  accesses
  
  Credits go to Sergey Petrunia for this description of the bug.
  
  For ref(const) access methods there are two kinds of statistics when
  the WHERE has an AND-part of  "t.keyXpart1 = c1 AND ... t.keyXpartN = cN", 
  N < key size.
  
  1. Table statistics, i.e. index cardinality values
  2. Range access estimate
  
  those estimates are not necessarily consistent with each other
  for example we can get
  
    E(#rows(t.key1part1 = c1)) < E( #rows( t.key1part1 = c1 
                                           AND     
                                           t.key1part2 IN (1, 2) ) 
                                  )
  or 
     
    E(#rows(t.key1part1 = c1)) < E( #rows( t.key1part1 = c1
                                         AND
                                         t.key1part2 = othertbl.field )
                                  )
  
  which is obviously not true. The source of discrepancy is that index
  cardinality values can be very inaccurate for skewed datasets. The code
  marked as ReuseRangeEstimateForRef-[1-4] tries to do a smart job of using
  the best available estimate. e.g. if we have candidate accesses:
  
    * B = ref-access( t.key1part1 = c1 AND t.key1part2 = othertbl.field )
    * A = range-access( t.key1part1 = c1 )
  
  and we have the E( #rows(A) ) <  E( #rows(A) ) as above, we'll pick
  the #rows estimate from range access and use it with ref access. (we
  will use ref access as it uses more keyparts, that's a heuristic)
  
  When we have range and ref over the same index we have these rules:
  
    * When access method X scans a subset of rows that access method Y will
      scan, use X, no matter what the cost numbers say.
  
    * Look if the obtained cost estimates are consistent. If there is an
      inconsistency, eliminate it by using range access estimates instead of
      ref.
  
  The bug is that the above rules weren't followed. best_access_path always chose
  the index with the lowest number of records, even if the figure wasn't reliable.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/22629

ChangeSet@1.2466, 2007-03-22 13:26:42+01:00, mhansson@linux-st28.site +3 -0
  bug #26106: Wrong plan may be chosen when there are several possible range and ref 
  accesses
  
  Credits go to Sergey Petrunia for this description of the bug.
  
  For ref(const) access methods there are two kinds of statistics when
  the WHERE has an AND-part of  "t.keyXpart1 = c1 AND ... t.keyXpartN = cN", 
  N < key size.
  
  1. Table statistics, i.e. index cardinality values
  2. Range access estimate
  
  those estimates are not necessarily consistent with each other
  for example we can get
  
    E(#rows(t.key1part1 = c1)) < E( #rows( t.key1part1 = c1 
                                           AND     
                                           t.key1part2 IN (1, 2) ) 
                                  )
  or 
     
    E(#rows(t.key1part1 = c1)) < E( #rows( t.key1part1 = c1
                                         AND
                                         t.key1part2 = othertbl.field )
                                  )
  
  which is obviously not true. The source of discrepancy is that index
  cardinality values can be very inaccurate for skewed datasets. The code
  marked as ReuseRangeEstimateForRef-[1-4] tries to do a smart job of using
  the best available estimate. e.g. if we have candidate accesses:
  
    * B = ref-access( t.key1part1 = c1 AND t.key1part2 = othertbl.field )
    * A = range-access( t.key1part1 = c1 )
  
  and we have the E( #rows(A) ) <  E( #rows(B) ) as above, we'll pick
  the #rows estimate from range access and use it with ref access. (we
  will use ref access as it uses more keyparts, that's a heuristic)
  
  When we have range and ref over the same index we have these rules:
  
    * When access method X scans a subset of rows that access method Y will
      scan, use X, no matter what the cost numbers say.
  
    * Look if the obtained cost estimates are consistent. If there is an
      inconsistency, eliminate it by using range access estimates instead of
      ref.
  
  The bug is that the above rules weren't followed. best_access_path always chose
  the index with the lowest number of records, even if the figure wasn't reliable.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/22633

ChangeSet@1.2466, 2007-03-22 13:48:41+01:00, mhansson@linux-st28.site +3 -0
  bug #26106: Wrong plan may be chosen when there are several possible range and ref 
  accesses
  
  Credits go to Sergey Petrunia for this description of the bug.
  
  For ref(const) access methods there are two kinds of statistics when
  the WHERE has an AND-part of  "t.keyXpart1 = c1 AND ... t.keyXpartN = cN", 
  N < key size.
  
  1. Table statistics, i.e. index cardinality values
  2. Range access estimate
  
  those estimates are not necessarily consistent with each other
  for example we can get
  
    E(#rows(t.key1part1 = c1)) < E( #rows( t.key1part1 = c1 
                                           AND     
                                           t.key1part2 IN (1, 2) ) 
                                  )
  or 
     
    E(#rows(t.key1part1 = c1)) < E( #rows( t.key1part1 = c1
                                         AND
                                         t.key1part2 = othertbl.field )
                                  )
  
  which is obviously not true. The source of discrepancy is that index
  cardinality values can be very inaccurate for skewed datasets. The code
  marked as ReuseRangeEstimateForRef-[1-4] tries to do a smart job of using
  the best available estimate. e.g. if we have candidate accesses:
  
    * B = ref-access( t.key1part1 = c1 AND t.key1part2 = othertbl.field )
    * A = range-access( t.key1part1 = c1 )
  
  and we have the E( #rows(A) ) <  E( #rows(B) ) as above, we'll pick
  the #rows estimate from range access and use it with ref access. (we
  will use ref access as it uses more keyparts, that's a heuristic)
  
  When we have range and ref over the same index we have these rules:
  
    * When access method X scans a subset of rows that access method Y will
      scan, use X, no matter what the cost numbers say.
  
    * Look if the obtained cost estimates are consistent. If there is an
      inconsistency, eliminate it by using range access estimates instead of
      ref.
  
  The bug is that the above rules weren't followed. best_access_path always chose
  the index with the lowest number of records, even if the figure wasn't reliable.

Another possible problem (maybe this needs to be separated into another
bug):

Indexes domain_id_3 and domain_message are defined over the same sets
of colums:

  domain_id_3(`domain_id`,`message_day`,`revert_domain`,`url_prefix`,`from_site_id`,`isexternal`)
  domain_message(`domain_id`,`message_day`,`isexternal`,`revert_domain`,`url_prefix`,`from_site_id`)

The query 

explain SELECT *
FROM    
  link_in85
WHERE    
  domain_id=434876 AND
  isexternal=1 AND    
  message_day IN ('2007-01-01','2007-01-02','2007-01-03','2007-01-04',
                  '2007-01-05','2007-01-06','2007-01-07','2007-01-08',
                  '2007-01-09','2007-01-10','2007-01-11','2007-01-12',
                  '2007-01-13','2007-01-14','2007-01-15','2007-01-16',
                  '2007-01-17','2007-01-18','2007-01-19','2007-01-20',
                  '2007-01-21','2007-01-22','2007-01-23','2007-01-24',
                  '2007-01-25','2007-01-26','2007-01-27','2007-01-28',
                  '2007-01-29','2007-01-30','2007-01-31') 

uses 2 key parts for domain_id_3, and 3 key parts for domain_message. 
The #records estimate for range is the same for both indexes and equal to 31.
(Perhaps the cause of this is that records_in_range() may not return 0, as
that signifies an empty range)

The default read_cost() functions produces equal costs, and domain_id_3 is
chosen over domain_message. 

Actually, the ranges over domain_id_3 contain 30 rows, while ranges over
domain_message contain 0 rows, so domain_message is better.

MySQL could analyze situations like this and chose index for which more
predicates/keyparts are used.

Investigation log
=================

table data
----------

select count(*) from link_in85 where  domain_id=434876 AND 
  message_day IN ('2007-01-01','2007-01-02','2007-01-03','2007-01-04',
                  '2007-01-05','2007-01-06','2007-01-07','2007-01-08',
                  '2007-01-09','2007-01-10','2007-01-11','2007-01-12',
                  '2007-01-13','2007-01-14','2007-01-15','2007-01-16',
                  '2007-01-17','2007-01-18','2007-01-19','2007-01-20',
                  '2007-01-21','2007-01-22','2007-01-23','2007-01-24',
                  '2007-01-25','2007-01-26','2007-01-27','2007-01-28',
                  '2007-01-29','2007-01-30','2007-01-31') ;
+----------+
| count(*) |
+----------+
|       31 | 
+----------+
1 row in set (0.06 sec)

select count(*) from link_in85 where  domain_id=434876 AND 
  message_day IN ('2007-01-01','2007-01-02','2007-01-03','2007-01-04',
                  '2007-01-05','2007-01-06','2007-01-07','2007-01-08',
                  '2007-01-09','2007-01-10','2007-01-11','2007-01-12',
                  '2007-01-13','2007-01-14','2007-01-15','2007-01-16',
                  '2007-01-17','2007-01-18','2007-01-19','2007-01-20',
                  '2007-01-21','2007-01-22','2007-01-23','2007-01-24',
                  '2007-01-25','2007-01-26','2007-01-27','2007-01-28',
                  '2007-01-29','2007-01-30','2007-01-31') and isexternal=1;
+----------+
| count(*) |
+----------+
|        0 | 
+----------+
1 row in set (0.01 sec)

Optimizer runs
--------------
explain select * from link_in85 force index(domain_id_3) 
where  domain_id=434876 AND isexternal=1 
  message_day IN ('2007-01-01','2007-01-02','2007-01-03','2007-01-04',
                  '2007-01-05','2007-01-06','2007-01-07','2007-01-08',
                  '2007-01-09','2007-01-10','2007-01-11','2007-01-12',
                  '2007-01-13','2007-01-14','2007-01-15','2007-01-16',
                  '2007-01-17','2007-01-18','2007-01-19','2007-01-20',
                  '2007-01-21','2007-01-22','2007-01-23','2007-01-24',
                  '2007-01-25','2007-01-26','2007-01-27','2007-01-28',
                  '2007-01-29','2007-01-30','2007-01-31') ;
+----+-------------+-----------+-------+---------------+-------------+---------+------+------+-------------+
| id | select_type | table     | type  | possible_keys | key         | key_len | ref  | rows | Extra       |
+----+-------------+-----------+-------+---------------+-------------+---------+------+------+-------------+
|  1 | SIMPLE      | link_in85 | range | domain_id_3   | domain_id_3 | 7       | NULL |   31 | Using where | 
+----+-------------+-----------+-------+---------------+-------------+---------+------+------+-------------+

In the .trace:
T@3    : | | | | | | | | | | info: Returning range plan for key domain_id_3, cost 68.21, records 31
T@3    : | | | | | | | | | <get_key_scans_params
...
quick range select, key domain_id_3, length: 7
  434876/2007-01-01 <= X <= 434876/2007-01-01
  434876/2007-01-02 <= X <= 434876/2007-01-02
  434876/2007-01-03 <= X <= 434876/2007-01-03
  434876/2007-01-04 <= X <= 434876/2007-01-04
  434876/2007-01-05 <= X <= 434876/2007-01-05
  434876/2007-01-06 <= X <= 434876/2007-01-06
  434876/2007-01-07 <= X <= 434876/2007-01-07
  434876/2007-01-08 <= X <= 434876/2007-01-08
  434876/2007-01-09 <= X <= 434876/2007-01-09
  434876/2007-01-10 <= X <= 434876/2007-01-10
  434876/2007-01-11 <= X <= 434876/2007-01-11
  434876/2007-01-12 <= X <= 434876/2007-01-12
  434876/2007-01-13 <= X <= 434876/2007-01-13
  434876/2007-01-14 <= X <= 434876/2007-01-14
  434876/2007-01-15 <= X <= 434876/2007-01-15
  434876/2007-01-16 <= X <= 434876/2007-01-16
  434876/2007-01-17 <= X <= 434876/2007-01-17
  434876/2007-01-18 <= X <= 434876/2007-01-18
  434876/2007-01-19 <= X <= 434876/2007-01-19
  434876/2007-01-20 <= X <= 434876/2007-01-20
  434876/2007-01-21 <= X <= 434876/2007-01-21
  434876/2007-01-22 <= X <= 434876/2007-01-22
  434876/2007-01-23 <= X <= 434876/2007-01-23
  434876/2007-01-24 <= X <= 434876/2007-01-24
  434876/2007-01-25 <= X <= 434876/2007-01-25
  434876/2007-01-26 <= X <= 434876/2007-01-26
  434876/2007-01-27 <= X <= 434876/2007-01-27
  434876/2007-01-28 <= X <= 434876/2007-01-28
  434876/2007-01-29 <= X <= 434876/2007-01-29
  434876/2007-01-30 <= X <= 434876/2007-01-30

=== Now for domain_message: ===

explain select * from link_in85 force index(domain_message) 
where  domain_id=434876 AND isexternal=1 
  message_day IN ('2007-01-01','2007-01-02','2007-01-03','2007-01-04',
                  '2007-01-05','2007-01-06','2007-01-07','2007-01-08',
                  '2007-01-09','2007-01-10','2007-01-11','2007-01-12',
                  '2007-01-13','2007-01-14','2007-01-15','2007-01-16',
                  '2007-01-17','2007-01-18','2007-01-19','2007-01-20',
                  '2007-01-21','2007-01-22','2007-01-23','2007-01-24',
                  '2007-01-25','2007-01-26','2007-01-27','2007-01-28',
                  '2007-01-29','2007-01-30','2007-01-31') ;
+----+-------------+-----------+-------+----------------+----------------+---------+------+------+-------------+
| id | select_type | table     | type  | possible_keys  | key            | key_len | ref  | rows | Extra       |
+----+-------------+-----------+-------+----------------+----------------+---------+------+------+-------------+
|  1 | SIMPLE      | link_in85 | range | domain_message | domain_message | 8       | NULL |   31 | Using where | 
+----+-------------+-----------+-------+----------------+----------------+---------+------+------+-------------+

In the .trace:

T@3    : | | | | | | | | | | info: Returning range plan for key domain_message, cost 68.21, records 31
T@3    : | | | | | | | | | <get_key_scans_params
quick range select, key domain_message, length: 8
  434876/2007-01-01/1 <= X <= 434876/2007-01-01/1
  434876/2007-01-02/1 <= X <= 434876/2007-01-02/1
  434876/2007-01-03/1 <= X <= 434876/2007-01-03/1
  434876/2007-01-04/1 <= X <= 434876/2007-01-04/1
  434876/2007-01-05/1 <= X <= 434876/2007-01-05/1
  434876/2007-01-06/1 <= X <= 434876/2007-01-06/1
  434876/2007-01-07/1 <= X <= 434876/2007-01-07/1
  434876/2007-01-08/1 <= X <= 434876/2007-01-08/1
  434876/2007-01-09/1 <= X <= 434876/2007-01-09/1
  434876/2007-01-10/1 <= X <= 434876/2007-01-10/1
  434876/2007-01-11/1 <= X <= 434876/2007-01-11/1
  434876/2007-01-12/1 <= X <= 434876/2007-01-12/1
  434876/2007-01-13/1 <= X <= 434876/2007-01-13/1
  434876/2007-01-14/1 <= X <= 434876/2007-01-14/1
  434876/2007-01-15/1 <= X <= 434876/2007-01-15/1
  434876/2007-01-16/1 <= X <= 434876/2007-01-16/1
  434876/2007-01-17/1 <= X <= 434876/2007-01-17/1
  434876/2007-01-18/1 <= X <= 434876/2007-01-18/1
  434876/2007-01-19/1 <= X <= 434876/2007-01-19/1
  434876/2007-01-20/1 <= X <= 434876/2007-01-20/1
  434876/2007-01-21/1 <= X <= 434876/2007-01-21/1
  434876/2007-01-22/1 <= X <= 434876/2007-01-22/1
  434876/2007-01-23/1 <= X <= 434876/2007-01-23/1
  434876/2007-01-24/1 <= X <= 434876/2007-01-24/1
  434876/2007-01-25/1 <= X <= 434876/2007-01-25/1
  434876/2007-01-26/1 <= X <= 434876/2007-01-26/1
  434876/2007-01-27/1 <= X <= 434876/2007-01-27/1
  434876/2007-01-28/1 <= X <= 434876/2007-01-28/1
  434876/2007-01-29/1 <= X <= 434876/2007-01-29/1
  434876/2007-01-30/1 <= X <= 434876/2007-01-30/1
  434876/2007-01-31/1 <= X <= 434876/2007-01-31/1
other_keys: 0x0:

Hi Sergey, this is interesting. Can you please post the dataset you were using for these last runs? I'll look into this and see if you can fix both these problems at the same time.

Mutually-inconsistent range problem aside, the patch does not solve the problem
that ref(idx_X) can be used where ref(idx_X) is also applicable and it would use more key parts (i.e. scan fewer records).

Using the patched tree, the supplied dataset:

mysql> explain SELECT
    ->   url_to,    
    ->   count(*) cnt,
    ->   revert_domain,
    ->   url_prefix,
    ->   count(distinct from_site_id) uniq  
    -> FROM 
    ->   link_in85 IGNORE INDEX (domain_message, domain_id_2) 
    -> WHERE 
    ->   domain_id=434876 AND
    ->   kind='link' AND    
    ->   isexternal=1 AND    
    ->   message_day IN ('2007-01-01','2007-01-02','2007-01-03','2007-01-04',
    ->                   '2007-01-05','2007-01-06','2007-01-07','2007-01-08',
    ->                   '2007-01-09','2007-01-10','2007-01-11','2007-01-12',
    ->                   '2007-01-13','2007-01-14','2007-01-15','2007-01-16',
    ->                   '2007-01-17','2007-01-18','2007-01-19','2007-01-20',
    ->                   '2007-01-21','2007-01-22','2007-01-23','2007-01-24',
    ->                   '2007-01-25','2007-01-26','2007-01-27','2007-01-28',
    ->                   '2007-01-29','2007-01-30','2007-01-31') 
    -> GROUP BY 
    ->   concat(revert_domain,url_prefix)  
    -> ORDER BY 
    ->   uniq desc, cnt desc
    -> LIMIT 10\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: link_in85
         type: ref
possible_keys: domain_id,domain_id_3
          key: domain_id_3
      key_len: 4
          ref: const
         rows: 2599902
        Extra: Using where; Using temporary; Using filesort

Index domain_id_3 is defined as (`domain_id`,`message_day`,`isexternal`,`revert_domain`,`url_prefix`,`from_site_id`)

The QEP uses ref(domain_id_3) over 1 key part (condition is domain_id=434876),
while range(domain_id_3) would use 3 key parts.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/28861

ChangeSet@1.2466, 2007-06-15 13:55:18+03:00, mhansson@linux-st28.site +3 -0
  bug #26106: Wrong plan may be chosen when there are several possible range and ref 
  accesses
  
  For ref(const) access methods there are two kinds of statistics when a prefix of a 
  key is matched: range access estimate and table statistics, which are not necessarily
  consistent with each other. This may make the optimizer choose a key which matches 
  fewer key parts than some other key, which is obviously not correct.
  
  Fixed by giving preferrence to the keys that match the most parts, and choosing the
  best one among these.

Requesting re-triage, the latest action was almost a year ago.

Knew about the bug for a while, but didn't know that there are patches exist. Sad that this is flagged as minor issue as it's easy to reproduce. Here is simple way:

mysql> select version();
| 5.0.77-log | 

mysql> create table b26106(type varchar(64), content varchar(64));
mysql> insert into b26106 values('ticket', 'asd');
mysql> insert into b26106 values('ticket', 'asd');
mysql> insert into b26106 values('ticket', 'zczx');
mysql> insert into b26106 values('ticket', 'sdfgs');
mysql> insert into b26106 values('ticket', 'qwer');
mysql> insert into b26106 values('ticket', 'ldgjh');

mysql> create index content on b26106(content);
Query OK, 6 rows affected (0.08 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> select * from b26106 where content >= 'd' and content <= 'z';
+--------+---------+
| type   | content |
+--------+---------+
| ticket | ldgjh   | 
| ticket | qwer    | 
| ticket | sdfgs   | 
+--------+---------+
3 rows in set (0.00 sec)

mysql> EXPLAIN select * from b26106 where content >= 'd' and content <= 'z';
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table  | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
|  1 | SIMPLE      | b26106 | range | content       | content | 67      | NULL |    2 | Using where | 
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
1 row in set (0.00 sec)

mysql> create index type_content on b26106(type, content);
Query OK, 6 rows affected (0.01 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> EXPLAIN select * from b26106 where type = 'ticket' AND content >= 'd' and content <= 'z';
+----+-------------+--------+------+----------------------+--------------+---------+-------+------+--------------------------+
| id | select_type | table  | type | possible_keys        | key          | key_len | ref   | rows | Extra                    |
+----+-------------+--------+------+----------------------+--------------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | b26106 | ref  | content,type_content | type_content | 67      | const |    1 | Using where; Using index | 
+----+-------------+--------+------+----------------------+--------------+---------+-------+------+--------------------------+
1 row in set (0.00 sec)

So there is no need in very special test case for this bug to manifest itself.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/112106

3270 Martin Hansson	2010-06-24
      Bug#26106: Wrong plan may be chosen when there are several possible range and
      ref accesses
        
      For ref(const) access methods there are two kinds of statistics when a prefix
      of a key is matched: range access estimate and table statistics, which are not
      necessarily consistent with each other. This may make the optimizer choose a
      key which matches fewer key parts than some other key, which is obviously not
      correct.
        
      Fixed by giving preferrence to the keys that match the most parts, and
      choosing the best one among these.

Pushed into mysql-next-mr (revid:alik@sun.com-20100723121929-90e9zemk3jkr2ocy) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (pib:18)

Pushed into mysql-trunk 5.6.1-m4 (revid:alik@ibmvm-20100804080001-bny5271e65xo34ig) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (merge vers: 5.5.5-m3) (pib:18)

Pushed into mysql-trunk 5.6.1-m4 (revid:alik@ibmvm-20100804081533-c1d3rbipo9e8rt1s) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (merge vers: 5.5.5-m3) (pib:18)

Noted in 5.6.0 changelog.

A suboptimal query execution plan could chosen when there were
several possible range and ref accesses. Now preference is given to
the keys that match the most parts and choosing the best one among
them.