MySQL Bugs: #3341: Not Optimized Min()

Bug #3341	Not Optimized Min()
Submitted:	30 Mar 2004 21:47	Modified:	10 May 2004 10:46
Reporter:	Sergei Kulakov (Candidate Quality Contributor)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server	Severity:	S3 (Non-critical)
Version:	3.23.58; 4.0.18	OS:	FreeBSD (Free BSD, Windows)
Assigned to:	Sergei Golubchik	CPU Architecture:	Any

Description:
This has been reproduced with:

-FreeBSD+MySQL 3.23.58
-Windows Me+MySQL 4.0.18 built from Source

Working with my stats tables, I was surprised to observe quite slow executon of a query that I thought would run optimized and fast. The table StatQuery to run against had the fileds:

Month Date, Type tinyint, Field mediumint, Count mediumint 
and a unique index (Month, Type, Field)

but the exact fields don't matter 'cause I was able to reproduce the result with a table that only has the first Month column. The query was 

Select Min(Month) From StatQuery Where Month>0 

and it was supposed to get the first non-zero month from the table. I keep a zero month as a special value of the last 30 days, and I need the first non-zero month to provide paging back and forth through months. The point is the query scanned the whole table (it contained about 40000 records and it took it about 0.9 sec to execute)! 

The explain said the number of rows to scan was all minus those that had Month=0, that is the query used the index to just find the first non-zero value and then ... it scanned the rest of the values using index to select min() among them while it obviously could just take the first one!

The query that emulates the right behaviour and at the same time provides the solution for me in this case is 

Select Month as Min From StatQueries Where Month>0 Order By Month Limit 1

'cause first it uses index to find the first non-zero month but then it just stops as soon as it finds 1 record, it executes at an instant. 

How to repeat:
In order to notice the difference in executon time the table has to have more rows, say 1000 or 10000 (the more, the better). I got a zipped one named "Month" at http://www.rlsnet.ru/Month.zip, its definition is

CREATE TABLE Month (
  Month date default NULL,
  KEY Month (Month)
)

Though it seems useless to store duplicate Month values, this table does demonstrate the phenomenon. Slow query:

Select Min(Month) as Min From Month Where Month>0

Explain:

table type possible_keys key key_len ref rows Extra 
Month range Month Month 4   9003 where used; Using index 

it executes 0.027 sec (it depends on the computer)

Fast query:

Select Month as Min From Month Where Month>0
Order By Month Limit 1

Explain is the same, the execution time is 0.001

Suggested fix:
Beyond my power. Check Max() as well

Sorry I wasn't able for to download the file:

http://www.rlsnet.ru/Month.zip.

I got message that the URL doesn't exist.

Unfortunately, I typed a comma after the URL and it was automatically converted into a link having the comma as the last character. So, you requested a file named "Month.zip," which obviuosly doesn't exist. The right URL is 

http://www.rlsnet.ru/Month.zip

(to prevent any possible errors, I put it on a separate line). The funny thing is that the link in your answer is correct, too 'cause it has a point after it. Click it and you'll get the file :)

I was able to repeat this on recent 4.0.19 code.  Thanks for your excellent bug report.

After downloading his table, I did "REPAIR TABLE Month USE_FRM" to restore an index file.

I did each query a few times, to verify his results.

Then I did FLUSH STATUS, and ran the min() query.  Then another FLUSH STATUS, and I ran the LIMIT query.  Here is a diff of the two:

tim@sand:m/40/m$ diff min-status.txt limit-status.txt       
6,7c6,7
< | Bytes_received           | 70    |
< | Bytes_sent               | 58    |
---
> | Bytes_received           | 88    |
> | Bytes_sent               | 63    |
87c87
< | Handler_read_next        | 9000  |
---
> | Handler_read_next        | 0     |
95c95
< | Key_read_requests        | 291   |
---
> | Key_read_requests        | 6     |
106c106
< | Questions                | 67    |
---
> | Questions                | 70    |
135c135
< | Uptime                   | 2231  |
---
> | Uptime                   | 2300  |

It is clear from the Handler_read_next that the min() query is not optimal.

Thank you for your bug report. This issue has already been fixed
in the latest released version of that product, which you can download at 
http://www.mysql.com/downloads/

Additional info:

Frankly speaking, I wasn't able to repeat the difference between the query with ORDER BY ... LIMIT 1 and without, but I agree MIN() handling is not perfect in 3.23 and 4.0 - it does scan the index.

Still, it's not a bug but a missing feature - missing optimization.
It was implemented in 4.1 already - your query (w/o order by ... limit 1) is executed in 4.1 by exactly one key lookup - no scan at all.

Thanks for the answers. I agree it's kind of a missing feature. But I believe any SQL driver should be able to use indices whenever possible, especially in trivial cases like this. Now and then I get queries for which MySQL doesn't use indices either, especially with backward scanning (it was reported). I mean this kind of missing feature is more important than say adding a function because it influences speed a lot, which I heard is of main priorities. MySQL also does not allow one get a previous value for an index (using HANDLE it is now possible, but I mean SELECT), only the next one. 

Yes, I can see the difference in using index in newer MySQL version - I got 4.0.1 and 4.0.18 at home and I had a query for which the first didn't use index while the second did (the reference was "const, const, tbl.fld" ). 
----------
REPAIR TABLE Month USE_FRM - why?! - I usually use just RESTORE TABLE ...
----------
I wasn't able to repeat the difference - where - in EXPLAIN or in execution time? If the second, the table is too small. If the first then maybe it was a newer version.

There's no big difference between REPAIR TABLE Month USE_FRM and RESTORE TABLE in this case.
So, both ways are correct :)

As for "previous value for an index" - MySQL can do it at least since ISAM (3.22.x) times. It is used for MAX/MIN optimization and (in 4.0) for optimizing ODER BY ... DESC.

Of course, I agree that MySQL "should be able to use indices whenever possible" - that's why this optimization is *already* implemented in 4.1 (not "will be" - it's done!). But it cannot be added to the stable branch 4.0, of course, as 4.0 is called "stable" exactly because we do not add new features into it.