Bug #94903 optimazer chooses inefficient plan for order by + limit in subquery
Submitted: 4 Apr 8:39 Modified: 8 Apr 20:27
Reporter: Василий Лукьянчиков Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:8.0.15 OS:Any
Assigned to: CPU Architecture:Any

[4 Apr 8:39] Василий Лукьянчиков
Description:
Optimazer chooses inefficient plan for statement like
..from table_1 LATERAL (select * from table_2 where col1=table_1.col order by col2 limit N)

Mysql do not uses unique index (col1,col2), instead of that it reads all rows from group (col1=table_1.col) and does filesort. 

How to repeat:
CREATE TABLE `posts` (
`post_id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(11) NOT NULL,
`date_added` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`post_text` text NOT NULL,
PRIMARY KEY (`post_id`),
UNIQUE KEY `user_id` (`user_id`,`date_added`)
) ENGINE=InnoDB AUTO_INCREMENT=32754 DEFAULT CHARSET=utf8;

/*
15962 rows in table
20 distinct user_id
~800 rows per group
*/

We want to find 3 last post for each user.

mysql> explain select t2.* from (select user_id from posts group by user_id) as t1,
-> lateral (select * from posts where t1.user_id=posts.user_id order by date_added desc limit 3) as t2;
+----+-------------------+------------+------------+-------+---------------+---------+---------+------------+------+----------+----------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------------+------------+------------+-------+---------------+---------+---------+------------+------+----------+----------------------------+
| 1 | PRIMARY | | NULL | ALL | NULL | NULL | NULL | NULL | 21 | 100.00 | Rematerialize () |
| 1 | PRIMARY | | NULL | ALL | NULL | NULL | NULL | NULL | 3 | 100.00 | NULL |
| 3 | DEPENDENT DERIVED | posts | NULL | ref | user_id | user_id | 4 | t1.user_id | 801 | 100.00 | Using filesort |
| 2 | DERIVED | posts | NULL | range | user_id | user_id | 4 | NULL | 21 | 100.00 | Using index for group-by |
+----+-------------------+------------+------------+-------+---------------+---------+---------+------------+------+----------+----------------------------+

You can download optimizer trace here: https://sqlinfo.ru/forum/attachment.php?item=671
[5 Apr 13:26] Sinisa Milivojevic
Hi Vasilii,

Thank you for your bug report.

In order to verify your bug, we need lot's of additional information.

First of all, you should measure the performance of the optimiser plan as is, by running query on the same tables, while MySQL server is totally idle.

Next, you can FORCE INDEX on the index that you feel would be faster and measure the time again. If you are using any Unix, LInux or Mac, you can use command `time` , by running both queries three times. 

Let us know all the times that you have obtained therewith.

Let me also tell you that GROUP BY resolution, without index, requires lots of time and machine resources.

After we get the speed of the two queries, we shall proceed with other requests for the information.

You can find how to set FORCE INDEX in our Reference Manual for 8.0.

Thanks in advance for your feedback.
[8 Apr 8:14] Василий Лукьянчиков
Hi Sinisa,

I use test instance on my home notebook. There is no other queries. MySQL server is totally idle.
First of all, I used force index. There is no effect.

Guilhem Bichot recommended workaround - extend ORDER BY, like

explain select t2.* from (select user_id from posts group by user_id) as t1, lateral (select * from posts where t1.user_id=posts.user_id order by user_id desc, date_added desc limit 3) as t2;
+----+-------------------+------------+------------+-------+---------------+---------+---------+------------+------+----------+----------------------------+
| id | select_type       | table      | partitions | type  | possible_keys | key     | key_len | ref        | rows | filtered | Extra                      |
+----+-------------------+------------+------------+-------+---------------+---------+---------+------------+------+----------+----------------------------+
|  1 | PRIMARY           | <derived2> | NULL       | ALL   | NULL          | NULL    | NULL    | NULL       |   21 |   100.00 | Rematerialize (<derived3>) |
|  1 | PRIMARY           | <derived3> | NULL       | ALL   | NULL          | NULL    | NULL    | NULL       |    3 |   100.00 | NULL                       |
|  3 | DEPENDENT DERIVED | posts      | NULL       | ref   | user_id       | user_id | 4       | t1.user_id |  748 |   100.00 | Backward index scan        |
|  2 | DERIVED           | posts      | NULL       | range | user_id       | user_id | 4       | NULL       |   21 |   100.00 | Using index for group-by   |
+----+-------------------+------------+------------+-------+---------------+---------+---------+------------+------+----------+----------------------------+

show status like 'handler%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 4     |
| Handler_mrr_init           | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 1     |
| Handler_read_key           | 42    |
| Handler_read_last          | 1     |
| Handler_read_next          | 0     |
| Handler_read_prev          | 40    |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_next      | 101   |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_update             | 0     |
| Handler_write              | 80    |
+----------------------------+-------+

show status like 'sort%';
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Sort_merge_passes | 0     |
| Sort_range        | 0     |
| Sort_rows         | 0     |
| Sort_scan         | 0     |
+-------------------+-------+

for comparison my original query without black magic:

explain select t2.* from (select user_id from posts group by user_id) as t1, lateral (select * from posts where t1.user_id=posts.user_id order by date_added desc limit 3) as t2;
+----+-------------------+------------+------------+-------+---------------+---------+---------+------------+------+----------+----------------------------+
| id | select_type       | table      | partitions | type  | possible_keys | key     | key_len | ref        | rows | filtered | Extra                      |
+----+-------------------+------------+------------+-------+---------------+---------+---------+------------+------+----------+----------------------------+
|  1 | PRIMARY           | <derived2> | NULL       | ALL   | NULL          | NULL    | NULL    | NULL       |   21 |   100.00 | Rematerialize (<derived3>) |
|  1 | PRIMARY           | <derived3> | NULL       | ALL   | NULL          | NULL    | NULL    | NULL       |    3 |   100.00 | NULL                       |
|  3 | DEPENDENT DERIVED | posts      | NULL       | ref   | user_id       | user_id | 4       | t1.user_id |  748 |   100.00 | Using filesort             |
|  2 | DERIVED           | posts      | NULL       | range | user_id       | user_id | 4       | NULL       |   21 |   100.00 | Using index for group-by   |
+----+-------------------+------------+------------+-------+---------------+---------+---------+------------+------+----------+----------------------------+

show status like 'handler%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 4     |
| Handler_mrr_init           | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 1     |
| Handler_read_key           | 102   |
| Handler_read_last          | 1     |
| Handler_read_next          | 15962 |  <-- ! MySQL server read all table
| Handler_read_prev          | 0     |
| Handler_read_rnd           | 60    |
| Handler_read_rnd_next      | 101   |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_update             | 0     |
| Handler_write              | 80    |
+----------------------------+-------+

show status like 'sort%';
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Sort_merge_passes | 0     |
| Sort_range        | 0     |
| Sort_rows         | 60    |
| Sort_scan         | 20    |
+-------------------+-------+
[8 Apr 13:19] Sinisa Milivojevic
Hi,

Thanks again, but you did not do what I asked you to do.

I just need times for the query without FORCE INDEX (the index that you think would work faster) and with FORCE INDEX.

You can also use optimiser hints that are described in our Reference Manual. There are cases where you can help Optimiser to choose the correct path.
[8 Apr 20:27] Guilhem Bichot
Hello Sinisa! I discussed with Vasiliy about this, in blog:
https://guilhembichot.blogspot.com/2019/01/support-for-lateral-derived-tables.html
and there is indeed a problem. It's not specific of LATERAL, though, can also be seen with:
explain select (select post_text from posts where t1.user_id=posts.user_id order by date_added limit 3) as s from posts t1;

Chaithra, who dealt a lot with GROUP BY recently, agrees there's a problem. In short, the LATERAL subquery should be able to just read the composite index (user_id,date_added) to get sorted data, because posts.user_id is actually constant for an execution; but MySQL doesn't see it as constant (as it's not a true constant, but an outer reference).
So I'm marking this bug as verified. Feel free to correct me if needed.