MySQL Bugs: #89743: Missing an efficient way of getting the last row in each group of an index

Bug #89743	Missing an efficient way of getting the last row in each group of an index
Submitted:	21 Feb 2018 6:40	Modified:	18 Apr 2018 12:32
Reporter:	Yoseph Phillips	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S4 (Feature request)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
We are using InnoBD on MySQL 5.7.21.
We often need to be able to perform queries as shown in the how to repeat section.

This is showing:
# id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
'1', 'PRIMARY', 't2', NULL, 'index', NULL, 'PRIMARY', '4', NULL, '3', '100.00', 'Using index'
'1', 'PRIMARY', 't1', NULL, 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'func', '1', '100.00', 'Using where'
'2', 'DEPENDENT SUBQUERY', 't3', NULL, 'ref', 'table1Index', 'table1Index', '4', 'ltcdb_maintenance.t2.id', '100', '100.00', 'Using where; Using index; Using filesort'

When table_1 has millions of records this is causing performance issues for us.

If our index has:
1 2018-02-21 17:14:29 567 
1 2018-02-21 17:14:37 324
1 2018-02-21 17:14:37 744
1 2018-02-21 17:14:42 346 <--
2 2018-02-21 17:14:27 569
2 2018-02-21 17:14:31  34
2 2018-02-21 17:14:36 287
2 2018-02-21 17:14:40 997
2 2018-02-21 17:14:40 478 <--

Then we simply need the two rows indicated with the arrows.

Doing a GROUP BY c1 and finding max(c2) then selecting max(id) matching those two values is very messy and does not seem to help performance either.

How to repeat:
DROP TABLE IF EXISTS table_1;

CREATE TABLE table_1(id INT NOT NULL AUTO_INCREMENT, c1 INT NOT NULL, c2 TIMESTAMP NOT NULL, INDEX table1Index(c1, c2), PRIMARY KEY (id));

DROP PROCEDURE IF EXISTS temporaryProcedurePopulateTable;

DELIMITER $$

CREATE PROCEDURE temporaryProcedurePopulateTable() 
BEGIN
  DECLARE counter INT DEFAULT 0;
  
  WHILE counter < 1000 DO
    INSERT INTO table_1(c1, c2) VALUES(floor(rand() * 10), now() + INTERVAL floor(rand() * 500) SECOND);
    SET counter = counter + 1;
  END WHILE;
END $$

DELIMITER ;

CALL temporaryProcedurePopulateTable();

DROP PROCEDURE IF EXISTS temporaryProcedurePopulateTable;

DROP TABLE IF EXISTS table_2;

CREATE TABLE table_2(id INT NOT NULL, PRIMARY KEY (id));

INSERT INTO table_2(id) VALUES (1), (2), (3);

EXPLAIN
SELECT t1.*
FROM table_2 t2
INNER JOIN table_1 t1 ON t1.id = (
  SELECT t3.id
  FROM table_1 t3 
  WHERE t3.c1 = t2.id
  ORDER BY t3.c2 DESC
  LIMIT 1
);

Hi!

Thank you for your report.

First of all, let me see if we understand you correctly. You are not reporting any bug, neither a crash nor wrong results. 

You are only asking us to tell you how to get results that you need more efficiently.

Correct ???

Hi,

Correct that this is not a bug, nor a crash nor wrong results.
It is producing the correct results, but not in a efficient way.

It is a RFE to improve the performance of MySQL and alternatively to show that there is already an efficient way of performing these queries.

I believe that it is common to have an index such as:
1 2018-02-21 17:14:29 567 
1 2018-02-21 17:14:37 324
1 2018-02-21 17:14:37 744
1 2018-02-21 17:14:42 346 <--
2 2018-02-21 17:14:27 569
2 2018-02-21 17:14:31  34
2 2018-02-21 17:14:36 287
2 2018-02-21 17:14:40 397
2 2018-02-21 17:14:40 478 <--

where one needs to be able to return the rows with the <--

A common way to do that is using:
SELECT t1.*
FROM table_2 t2
INNER JOIN table_1 t1 ON t1.id = (
  SELECT t3.id
  FROM table_1 t3 
  WHERE t3.c1 = t2.id
  ORDER BY t3.c2 DESC, t3.id DESC
  LIMIT 1
);

the issue is that as EXPLAIN is showing for t3 it is Using where; Using index; Using filesort and has 100 in the rows column.

When we do this with millions of records the performance is terrible.

As mentioned I have also tried other ways with MySQL such as doing a GROUP BY c1 and finding max(c2) then selecting max(id) for all records with c1 and max(c2) however this is both messy and doesn't seem to help improve performance.

Logically given the index I would expect that there was a way to get MySQL just to look at a few rows in order to work out the last rows for each c1 in the index which would obviously perform much faster than looking through millions of rows each time in a dependent subquery.

I don't believe that MySQL 5.7.21 has an efficient way of doing this, hence the RFE.

I hope that clarifies things,
Yoseph

Hi!

First of all, I guess that RFE is Request For Enhancement. We do not call it that way, we call it a feature request. 

Before we could verify a feature, it is expected that you use optimiser directions and options which would render a much better performance. We can not ask for a feature if we do not know which optimiser plan would do significantly better.

Last, but not least, you have one dependent subqueries and those are very inefficient, by their basic definition, as you can read about it in our manual.

I have already read through all of:
https://dev.mysql.com/doc/refman/5.7/en/optimizing-subqueries.html
https://dev.mysql.com/doc/refman/5.7/en/subquery-optimization.html
https://dev.mysql.com/doc/refman/5.7/en/rewriting-subqueries.html
(and many other places for that matter)

Using this index as an example:
1 2018-02-21 17:14:29 567 
1 2018-02-21 17:14:37 324
1 2018-02-21 17:14:37 744
1 2018-02-21 17:14:42 346 <--
2 2018-02-21 17:14:27 569
2 2018-02-21 17:14:31  34
2 2018-02-21 17:14:36 287
2 2018-02-21 17:14:40 397
2 2018-02-21 17:14:40 478 <--

In this case it is challenging to do these queries without any dependent subqueries as multiple rows in each group can have the same TIMESTAMP and for the ones with the same max TIMESTAMP in each group we need the max id (346 and 478 above).

If MySQL had something like last() for example:
SELECT t.c1, last(t.id)
FROM table_1 t 
GROUP BY t.c1
ORDER BY t.c2 DESC, t.id DESC; 

then we would easily be able to find the last ids in each group (346 and 478) in this case, and be able to easily use joins instead of dependent subqueries which could well perform better, however MySQL doesn't have last().

Also I should point out that in our case we only need these last rows for a few groups and so it would be wasteful calculating them for every single group. In some of our installations with millions of records we have 8,000 or more groups which is another reason why dependent subqueries might be a better option here.

Something else that we tried was using even more dependent subqueries:

EXPLAIN
SELECT t1.*
FROM table_2 t2
INNER JOIN table_1 t1 ON t1.id = (
  SELECT max(t3.id)
  FROM table_1 t3 
  WHERE t3.c1 = t2.id AND t3.c2 = (
	  SELECT max(t4.c2)
	  FROM table_1 t4 
	  WHERE t4.c1 = t2.id  
	)
);

which shows:

# id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
'1', 'PRIMARY', 't2', NULL, 'index', NULL, 'PRIMARY', '4', NULL, '3', '100.00', 'Using index'
'1', 'PRIMARY', 't1', NULL, 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'func', '1', '100.00', 'Using where'
'2', 'DEPENDENT SUBQUERY', 't3', NULL, 'ref', 'table1Index', 'table1Index', '8', 'test.t2.id,func', '2', '100.00', 'Using where; Using index'
'3', 'DEPENDENT SUBQUERY', 't4', NULL, 'ref', 'table1Index', 'table1Index', '4', 'test.t2.id', '100', '100.00', 'Using index'

In our case with millions of rows in table_1 this seems to perform better than using the LIMIT 1 way, probably due to the LIMIT 1 way is doing Using where; Using index; Using filesort which is very slow.

As can be seen in this explain it is still showing that it is expecting to examine 100 rows each time for t4.

My thoughts are that MySQL should have an efficient way to get the last row of each group.

For example:
If we have 1,000,000 records then for group 1 it should be able to examine approximately a logarithmic number of rows, starting at the page of the 500,000th record for example and then checking to see if it then needs to look at the page with the 250,000th record or the one with the 750,000th record (all still using the index). It should be searching for a slot that is always greater than group 1 but less than group 2. Once it has this spot it knows it just needs the record before this spot. This should basically be the same as when MySQL needs to insert a record into an index.

Hi,

First of all, I do not know of any RDBMS system that has LAST() function. Can you please send us an URL that explains the functioning of that function. You can get functionality that you desire by different SQL constructs, which are explained in many SQL textbooks.

Second, "using where; using index; filesort" is in many, many cases the most efficient choice.

Third, dependent nested queries are forcing that for all filtered rows from the outer table(s) a nested query is executed, each time. Number of dependent nested queries is lesser problem then number of rows for each nested query to be re-run. That is how SQL is defined. You can read about it in SQL2011 standard.

Fourth, I do assure you that LIMIT 1 does not slow anything, except in few cases, like when there is actually one row in the result set.

There are lots of URLs that explain how last() would function such as:
https://stackoverflow.com/questions/1313120/retrieving-the-last-record-in-each-group
simply getting the last row of each group. last() just seemed to me like an obvious choice for a function name in MySQL consistent with other MySQL function names.

I know of many different ways of getting these results, and the link above also links to many pages where people are getting the results. The question is how to do this efficiently in MySQL. As many people in those links have also noted many solutions do not scale well.

Agreed, that often filesort can be the most efficient choice, however in our case with millions of rows it performs very slowly. The slow query log shows that it is examine millions of rows.

I also agree with your statement amount the number of dependent subqueries - I often find that these can perform well when they do not need to examine lots of rows.

I have not tested, but it sounds like we might be able to get the performance that we need when MySQL 8.0 comes out by use of the new windowing functions (see Bill Karwin's answer in that post). That has been a common way of doing this in other RDBMS and so I am happy to see that it has made it into MySQL.

First of all, MySQL 5.7 is a stable GA version which will not receive any major features.

Second, SQL language features that are not according to SQL standard will hardly find their way into MySQL.

Third, our 8.0 version already has SQL compatible syntax for acquiring the desired result.

Basis on all of these, I do not consider this as a bug nor as a valid feature request.

I have posted a more detailed answer to this on https://stackoverflow.com/questions/1313120/retrieving-the-last-record-in-each-group-mysql...

On that page I have explained how MySQL could hypothetically do this very quickly by use of the last() function.

I have now tested all of these queries on 8.0.4-rc including the new PARTITION way that was not available in 5.7.21 and none of them are efficient. As can be seen my original LIMIT 1 is still the most efficient way of doing this in MySQL that we know of - I have tested everyone else's suggestions as well. That LIMIT 1 way is causing huge performance issues when we are dealing with tables with more than 60 million rows. 
In our cases we are getting back less than 2,000 rows and yet MySQL is examining more than 3.5 million rows each time. It could hypothetically just be examining a few thousand rows and executing in 0 seconds.

It doesn't need to be fixed by use of a last() function or even make it into 5.7, just some way that we can get the performance that we need out of MySQL.

Please reconsider this feature request. We have clients paying to use the Enterprise Edition of MySQL. We need Oracle's support with this.

Hi,

Thank you for your analysis.

I think that we can accept this feature request for 8.0+.

Verified as a feature request.