Bug #54599 bug in query planner on queries with "ORDER BY" and "LIMIT BY" clause
Submitted: 17 Jun 2010 19:42 Modified: 24 Jul 2012 14:54
Reporter: Zardosht Kasheff (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:5.1.46, 5.1, 5.6.99 bzr OS:Any
Assigned to: CPU Architecture:Any
Tags: limit by, order by

[17 Jun 2010 19:42] Zardosht Kasheff
Description:
I have a MyISAM table foo with the following schema:

(a int, b int, c int, d int, ....)

and with keys:
(a,b,c)
(c,b,d,a)

The query is :
select c, count(*) num_cnt, from foo where a > 1000 and b is not NULL,
group by c, order by num_cnt;

Note that both keys are covering indexes.

When I run the query, MySQL properly selects the key (a,b,c). However,
when I add a "limit 10" to the end, the key (c,b,d,a) is used. Using
(c,b,d,a) is not good because the entire index will need to be read to
find the 10 values of 'c' that show up the most, whereas if (a,b,c) is
used, only a range is done (where a > 1000).

When I run this with MyISAM, the proper query plan returns in 1
second, whereas the bad one returns in 6 seconds.

How to repeat:
read above

Suggested fix:
because the order by clause prevents the limit by from reading a small subset of the table, the key (c,b,d,a) should not be used.
[17 Jun 2010 20:52] Sveta Smirnova
Thank you for the report.

Please provide real queries and create table info: with test data in my environment key(a,b,c) is always used.
[18 Jun 2010 5:47] Sveta Smirnova
Thank you for the feedback.

I understand optimizer does optimization wrongly for customer data. But we can not verify and fix the bug until we are able to repeat the problem. But is not repeatable if I use following code:

create table t1 (a int, b int, c int, d int, key(a,b,c), key(c,b,d,a));

--disable_query_log
let $i = 1000;
while ($i)
{
--eval insert into t1 values($i,$i,$i,$i)
dec $i;
}
--enable_query_log

--vertical_results
explain select c, count(*) num_cnt from t1 where a > 1000 and b is not NULL
group by c order by num_cnt;

explain select c, count(*) num_cnt from t1 where a > 1000 and b is not NULL
group by c order by num_cnt limit 10;

Please change real column names to a,b,c, etc., table name to t1 and send us table structure and real query with such replacement which hides private data, but does not hide structure.
[18 Jun 2010 14:43] Zardosht Kasheff
How to reproduce:

CREATE TABLE `t` (
  `a` int(11) DEFAULT NULL,
  `b` int(11) DEFAULT NULL,
  `c` int(11) DEFAULT NULL,
  KEY `a` (`a`,`b`,`c`),
  KEY `c` (`c`,`b`,`a`)
)

insert elements (1,1,1),(1,1,2),(1,1,3)...(1,1,300),(1,2,1),(1,2,2)...(300,300,300)

So there should be 300*300*300 elements.

Then do the following queries:
mysql> explain select c,count(*) numimp from t where a > 250 group by c order by
 numimp limit 10;
+----+-------------+-------+-------+---------------+------+---------+------+----
--+-----------------------------------------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | row
s | Extra                                                     |
+----+-------------+-------+-------+---------------+------+---------+------+----
--+-----------------------------------------------------------+
|  1 | SIMPLE      | t     | index | a             | c    | 15      | NULL |   7
2 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+-------+-------+---------------+------+---------+------+----
--+-----------------------------------------------------------+
1 row in set (0.00 sec)

mysql> explain select c,count(*) numimp from t where a > 250 group by c order by
 numimp;
+----+-------------+-------+-------+---------------+------+---------+------+----
-----+-----------------------------------------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | row
s    | Extra                                                     |
+----+-------------+-------+-------+---------------+------+---------+------+----
-----+-----------------------------------------------------------+
|  1 | SIMPLE      | t     | range | a             | a    | 5       | NULL | 336
4000 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+-------+-------+---------------+------+---------+------+----
-----+-----------------------------------------------------------+
1 row in set (0.00 sec)

Both queries should use 'a'. Only the one without the "LIMIT BY" does.
[18 Jun 2010 19:17] Sveta Smirnova
Thank you for the feedback.

In the initial description you wrote:

"When I run the query, MySQL properly selects the key (a,b,c). However,
when I add a "limit 10" to the end, the key (c,b,d,a) is used. Using
(c,b,d,a) is not good because the entire index will need to be read to
find the 10 values of 'c' that show up the most, whereas if (a,b,c) is
used, only a range is done (where a > 1000)."

While output from last comment shows only 72 rows examined if limit used and 3364000 if no limit used.

Additionally I see no performance regression with test data again:

select now();
now()   2010-06-18 22:10:28
select c, count(*) num_cnt from t1 where a > 250 group by c order by num_cnt;
<results>
select now();
now()   2010-06-18 22:10:40
select c, count(*) num_cnt from t1 where a > 250 group by c order by num_cnt limit 10;
<results>
select now();
now()   2010-06-18 22:10:46

Why do you consider this behavior buggy?
[18 Jun 2010 19:41] Zardosht Kasheff
Don't trust the rows value in the explain statement. It is wrong. If it really did only examine 72 rows in the "limit by" case, it would not have finished in 6 seconds. It is doing a full index scan.

The problem with the query plan is that using the key 'c' requires a full index scan, whereas using 'a' requires a partial index scan. Using 'c' requires a full index scan because of the "order by" clause. If the "order by" clause did not exist, then the query plan would be fine.

Instead of having "where a > 250", try "where a > 285". The problem still exists.
[18 Jun 2010 22:56] Sveta Smirnova
Thank you for the feedback.

Verified as described: after limiting area for search query with LIMIT works slower than query without LIMIT.

Test case for MTR:

create table t1 (a int, b int, c int, key(a,b,c), key(c,b,a));

--disable_query_log
let $i = 1000;
while ($i)
{
let $j = 1000;
while ($j)
{
--eval insert into t1 values($i,$j,$i)
dec $j;
}
dec $i;
}
--enable_query_log

--vertical_results
explain select c, count(*) num_cnt from t1 where a > 885 group by c order by num_cnt;

explain select c, count(*) num_cnt from t1 where a > 885 group by c order by num_cnt limit 10;

select now();
--disable_result_log
select c, count(*) num_cnt from t1 where a > 885 group by c order by num_cnt;
--enable_result_log
select now();
--disable_result_log
select c, count(*) num_cnt from t1 where a > 885 group by c order by num_cnt limit 10;
--enable_result_log
select now();
[22 Jun 2010 17:48] Omer Barnir
triage: setting to CHECKED (w3 use ignore index)
[24 Jul 2012 14:54] Paul DuBois
Noted in 5.1.66, 5.5.28, 5.6.7, 5.7.0 changelogs.

Adding a LIMIT clause to a query containing GROUP BY and ORDER BY
could cause the optimizer to choose an incorrect index for processing
the query, and return more rows than required.