Description:
Assume we have a large table with primary key "id".
The following query is very fast as expected:
```
SELECT MAX(id) FROM my_table
```
Often, you want to find the max ID with a WHERE condition. For example, on a table with a created_at timestamp but no index on created_at you might want to find the ID corresponding to a certain timestamp.
What we'd like for the query engine to do is scan backwards from the highest ID and short-circuit once it finds the first ID matching the where clause. This will give very good performance if the where clause is met early in the scan and degrade to a full scan if it doesn't.
The naive way to do this is:
```
SELECT MAX(id) FROM my_table WHERE created_at < '...'
```
However, this query does not perform optimally; it's performance is unchanged if MAX is replaced with an aggregate like SUM which does not permit short-circuiting.
This less-obvious query does use the desired algorithm and is significantly faster on real use-cases:
```
SELECT id FROM my_table WHERE created_at < '...' ORDER BY id DESC LIMIT 1
```
How to repeat:
The performance difference reproduces reliably for me on large tables (e.g. 2.5MM rows), with the ORDER BY LIMIT 1 variant being ~20x faster than MAX for my use-case.
Here's a self-contained reproduction:
```
drop table temp_numbers;
create table temp_numbers as (
select 1 as n union all select 2 union all select 3 union all select 4 union all select 5
union all select 6 as n union all select 7 union all select 8 union all select 9 union all select 10);
drop temporary table temp_repro;
create temporary table temp_repro (
id bigint unsigned not null auto_increment primary key,
n bigint not null,
n2 bigint not null,
n3 bigint not null,
rand double not null);
insert into temp_repro (n, n2, n3, rand)
select n.n, n2.n, n3.n, rand()
from temp_numbers n
cross join temp_numbers n2
cross join temp_numbers n3
cross join temp_numbers n4
cross join temp_numbers n5
cross join temp_numbers n6;
select max(id) from temp_repro where rand < .01; -- reliably ~.25s
select id from temp_repro where rand < .01 order by id desc limit 1; -- reliably <= 0.016s
```
Suggested fix:
It would be great if the optimizer could recognize the common case of a MAX query on a primary key (or indexed column) and apply the same fast plan. This generalizes to MIN selects as well, but the desired scan direction should be reversed.
I don't know much about the engine internals, but perhaps simply recognizing the pattern `SELECT MIN/MAX(<expr>) FROM <table> WHERE <condition>` and, where <expr> is indexed, transforming to `SELECT <expr> FROM <table> WHERE <condition> ORDER BY <expr> [DESC] LIMIT 1` would suffice.
Description: Assume we have a large table with primary key "id". The following query is very fast as expected: ``` SELECT MAX(id) FROM my_table ``` Often, you want to find the max ID with a WHERE condition. For example, on a table with a created_at timestamp but no index on created_at you might want to find the ID corresponding to a certain timestamp. What we'd like for the query engine to do is scan backwards from the highest ID and short-circuit once it finds the first ID matching the where clause. This will give very good performance if the where clause is met early in the scan and degrade to a full scan if it doesn't. The naive way to do this is: ``` SELECT MAX(id) FROM my_table WHERE created_at < '...' ``` However, this query does not perform optimally; it's performance is unchanged if MAX is replaced with an aggregate like SUM which does not permit short-circuiting. This less-obvious query does use the desired algorithm and is significantly faster on real use-cases: ``` SELECT id FROM my_table WHERE created_at < '...' ORDER BY id DESC LIMIT 1 ``` How to repeat: The performance difference reproduces reliably for me on large tables (e.g. 2.5MM rows), with the ORDER BY LIMIT 1 variant being ~20x faster than MAX for my use-case. Here's a self-contained reproduction: ``` drop table temp_numbers; create table temp_numbers as ( select 1 as n union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 as n union all select 7 union all select 8 union all select 9 union all select 10); drop temporary table temp_repro; create temporary table temp_repro ( id bigint unsigned not null auto_increment primary key, n bigint not null, n2 bigint not null, n3 bigint not null, rand double not null); insert into temp_repro (n, n2, n3, rand) select n.n, n2.n, n3.n, rand() from temp_numbers n cross join temp_numbers n2 cross join temp_numbers n3 cross join temp_numbers n4 cross join temp_numbers n5 cross join temp_numbers n6; select max(id) from temp_repro where rand < .01; -- reliably ~.25s select id from temp_repro where rand < .01 order by id desc limit 1; -- reliably <= 0.016s ``` Suggested fix: It would be great if the optimizer could recognize the common case of a MAX query on a primary key (or indexed column) and apply the same fast plan. This generalizes to MIN selects as well, but the desired scan direction should be reversed. I don't know much about the engine internals, but perhaps simply recognizing the pattern `SELECT MIN/MAX(<expr>) FROM <table> WHERE <condition>` and, where <expr> is indexed, transforming to `SELECT <expr> FROM <table> WHERE <condition> ORDER BY <expr> [DESC] LIMIT 1` would suffice.