MySQL Bugs: #54599: bug in query planner on queries with "ORDER BY" and "LIMIT BY" clause

Bug #54599	bug in query planner on queries with "ORDER BY" and "LIMIT BY" clause
Submitted:	17 Jun 2010 19:42	Modified:	24 Jul 2012 14:54
Reporter:	Zardosht Kasheff (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S3 (Non-critical)
Version:	5.1.46, 5.1, 5.6.99 bzr	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	limit by, order by

Description:
I have a MyISAM table foo with the following schema:

(a int, b int, c int, d int, ....)

and with keys:
(a,b,c)
(c,b,d,a)

The query is :
select c, count(*) num_cnt, from foo where a > 1000 and b is not NULL,
group by c, order by num_cnt;

Note that both keys are covering indexes.

When I run the query, MySQL properly selects the key (a,b,c). However,
when I add a "limit 10" to the end, the key (c,b,d,a) is used. Using
(c,b,d,a) is not good because the entire index will need to be read to
find the 10 values of 'c' that show up the most, whereas if (a,b,c) is
used, only a range is done (where a > 1000).

When I run this with MyISAM, the proper query plan returns in 1
second, whereas the bad one returns in 6 seconds.

How to repeat:
read above

Suggested fix:
because the order by clause prevents the limit by from reading a small subset of the table, the key (c,b,d,a) should not be used.

Thank you for the report.

Please provide real queries and create table info: with test data in my environment key(a,b,c) is always used.

Thank you for the feedback.

I understand optimizer does optimization wrongly for customer data. But we can not verify and fix the bug until we are able to repeat the problem. But is not repeatable if I use following code:

create table t1 (a int, b int, c int, d int, key(a,b,c), key(c,b,d,a));

--disable_query_log
let $i = 1000;
while ($i)
{
--eval insert into t1 values($i,$i,$i,$i)
dec $i;
}
--enable_query_log

--vertical_results
explain select c, count(*) num_cnt from t1 where a > 1000 and b is not NULL
group by c order by num_cnt;

explain select c, count(*) num_cnt from t1 where a > 1000 and b is not NULL
group by c order by num_cnt limit 10;

Please change real column names to a,b,c, etc., table name to t1 and send us table structure and real query with such replacement which hides private data, but does not hide structure.

How to reproduce:

CREATE TABLE `t` (
  `a` int(11) DEFAULT NULL,
  `b` int(11) DEFAULT NULL,
  `c` int(11) DEFAULT NULL,
  KEY `a` (`a`,`b`,`c`),
  KEY `c` (`c`,`b`,`a`)
)

insert elements (1,1,1),(1,1,2),(1,1,3)...(1,1,300),(1,2,1),(1,2,2)...(300,300,300)

So there should be 300*300*300 elements.

Then do the following queries:
mysql> explain select c,count(*) numimp from t where a > 250 group by c order by
 numimp limit 10;
+----+-------------+-------+-------+---------------+------+---------+------+----
--+-----------------------------------------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | row
s | Extra                                                     |
+----+-------------+-------+-------+---------------+------+---------+------+----
--+-----------------------------------------------------------+
|  1 | SIMPLE      | t     | index | a             | c    | 15      | NULL |   7
2 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+-------+-------+---------------+------+---------+------+----
--+-----------------------------------------------------------+
1 row in set (0.00 sec)

mysql> explain select c,count(*) numimp from t where a > 250 group by c order by
 numimp;
+----+-------------+-------+-------+---------------+------+---------+------+----
-----+-----------------------------------------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | row
s    | Extra                                                     |
+----+-------------+-------+-------+---------------+------+---------+------+----
-----+-----------------------------------------------------------+
|  1 | SIMPLE      | t     | range | a             | a    | 5       | NULL | 336
4000 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+-------+-------+---------------+------+---------+------+----
-----+-----------------------------------------------------------+
1 row in set (0.00 sec)

Both queries should use 'a'. Only the one without the "LIMIT BY" does.

Thank you for the feedback.

In the initial description you wrote:

"When I run the query, MySQL properly selects the key (a,b,c). However,
when I add a "limit 10" to the end, the key (c,b,d,a) is used. Using
(c,b,d,a) is not good because the entire index will need to be read to
find the 10 values of 'c' that show up the most, whereas if (a,b,c) is
used, only a range is done (where a > 1000)."

While output from last comment shows only 72 rows examined if limit used and 3364000 if no limit used.

Additionally I see no performance regression with test data again:

select now();
now()   2010-06-18 22:10:28
select c, count(*) num_cnt from t1 where a > 250 group by c order by num_cnt;
<results>
select now();
now()   2010-06-18 22:10:40
select c, count(*) num_cnt from t1 where a > 250 group by c order by num_cnt limit 10;
<results>
select now();
now()   2010-06-18 22:10:46

Why do you consider this behavior buggy?

Don't trust the rows value in the explain statement. It is wrong. If it really did only examine 72 rows in the "limit by" case, it would not have finished in 6 seconds. It is doing a full index scan.

The problem with the query plan is that using the key 'c' requires a full index scan, whereas using 'a' requires a partial index scan. Using 'c' requires a full index scan because of the "order by" clause. If the "order by" clause did not exist, then the query plan would be fine.

Instead of having "where a > 250", try "where a > 285". The problem still exists.

Thank you for the feedback.

Verified as described: after limiting area for search query with LIMIT works slower than query without LIMIT.

Test case for MTR:

create table t1 (a int, b int, c int, key(a,b,c), key(c,b,a));

--disable_query_log
let $i = 1000;
while ($i)
{
let $j = 1000;
while ($j)
{
--eval insert into t1 values($i,$j,$i)
dec $j;
}
dec $i;
}
--enable_query_log

--vertical_results
explain select c, count(*) num_cnt from t1 where a > 885 group by c order by num_cnt;

explain select c, count(*) num_cnt from t1 where a > 885 group by c order by num_cnt limit 10;

select now();
--disable_result_log
select c, count(*) num_cnt from t1 where a > 885 group by c order by num_cnt;
--enable_result_log
select now();
--disable_result_log
select c, count(*) num_cnt from t1 where a > 885 group by c order by num_cnt limit 10;
--enable_result_log
select now();

triage: setting to CHECKED (w3 use ignore index)

Noted in 5.1.66, 5.5.28, 5.6.7, 5.7.0 changelogs.

Adding a LIMIT clause to a query containing GROUP BY and ORDER BY
could cause the optimizer to choose an incorrect index for processing
the query, and return more rows than required.