Bug #119531 Optimizer should use a short-circuit DESC scan for SELECT MAX(PK) WHERE ...
Submitted: 9 Dec 14:40 Modified: 11 Dec 8:27
Reporter: Michael Adelson Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S5 (Performance)
Version:8.4.6 OS:Any
Assigned to: CPU Architecture:Any
Tags: MAX, optimization, query plan, query planner

[9 Dec 14:40] Michael Adelson
Description:
Assume we have a large table with primary key "id".

The following query is very fast as expected:
```
SELECT MAX(id) FROM my_table
```

Often, you want to find the max ID with a WHERE condition. For example, on a table with a created_at timestamp but no index on created_at you might want to find the ID corresponding to a certain timestamp.

What we'd like for the query engine to do is scan backwards from the highest ID and short-circuit once it finds the first ID matching the where clause. This will give very good performance if the where clause is met early in the scan and degrade to a full scan if it doesn't.

The naive way to do this is:
```
SELECT MAX(id) FROM my_table WHERE created_at < '...'
```

However, this query does not perform optimally; it's performance is unchanged if MAX is replaced with an aggregate like SUM which does not permit short-circuiting.

This less-obvious query does use the desired algorithm and is significantly faster on real use-cases:
```
SELECT id FROM my_table WHERE created_at < '...' ORDER BY id DESC LIMIT 1
```

How to repeat:
The performance difference reproduces reliably for me on large tables (e.g. 2.5MM rows), with the ORDER BY LIMIT 1 variant being ~20x faster than MAX for my use-case.

Here's a self-contained reproduction:
```
drop table temp_numbers;
    create table temp_numbers as (
		select 1 as n union all select 2 union all select 3 union all select 4 union all select 5
        union all select 6 as n union all select 7 union all select 8 union all select 9 union all select 10);
    
    drop temporary table temp_repro;
    create temporary table temp_repro (
		id bigint unsigned not null auto_increment primary key, 
        n bigint not null,
        n2 bigint not null,
        n3 bigint not null,
        rand double not null);
        
    insert into temp_repro (n, n2, n3, rand)
    select n.n, n2.n, n3.n, rand()
    from temp_numbers n
    cross join temp_numbers n2
    cross join temp_numbers n3
    cross join temp_numbers n4
    cross join temp_numbers n5
    cross join temp_numbers n6;
    
    select max(id) from temp_repro where rand < .01; -- reliably ~.25s
    select id from temp_repro where rand < .01 order by id desc limit 1; -- reliably <= 0.016s
```

Suggested fix:
It would be great if the optimizer could recognize the common case of a MAX query on a primary key (or indexed column) and apply the same fast plan. This generalizes to MIN selects as well, but the desired scan direction should be reversed.

I don't know much about the engine internals, but perhaps simply recognizing the pattern `SELECT MIN/MAX(<expr>) FROM <table> WHERE <condition>` and, where <expr> is indexed, transforming to `SELECT <expr> FROM <table> WHERE <condition> ORDER BY <expr> [DESC] LIMIT 1` would suffice.
[11 Dec 8:27] Knut Anders Hatlen
Thanks for this feature request for the optimizer.

One thing to note is that for MIN(x), the suggested transformation is valid only if x is a non-nullable column, as ORDER BY x ASC would give you the rows with NULL in x first, and MIN is supposed to ignore NULL values according to the SQL standard, if I remember correctly.