MySQL Bugs: #25402: Optimisation replacing filesort for bigger result sets

Bug #25402	Optimisation replacing filesort for bigger result sets
Submitted:	4 Jan 2007 1:58	Modified:	18 Aug 2007 19:49
Reporter:	Arjen Lentz	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S4 (Feature request)
Version:	any	OS:	Any (any)
Assigned to:		CPU Architecture:	Any
Tags:	filesort, GROUP BY, optimization, Optimizer, order by, sorting

Description:
Filesort can become very slow when dealing with big result sets on a busy system (many active connections). Even if the set can be kept completely in memory, significant CPU time is used.

An effective workaround is using a temporary table with an index:
  CREATE TEMPORARY TABLE tmp (INDEX sortkey (sortkey,id))
         SELECT sortkey,id FROM maintable

  SELECT <allfields>
    FROM tmp FORCE INDEX (sortkey)
    JOIN maintable ON (tmp.id=maintable.id)
   ORDER BY sortkey

  DROP TEMPORARY TABLE tmp

Notes:
 - The above turns out to be many times faster, reducing query time from many minutes to a few seconds or even subsecond response.
 - This is true even with the overhead of creating/dropping the temp table and the additional queries.
 - In the above, server will generally note "Using index" (index scan) on its own, so FORCE INDEX not required.
 - In the above, the tmp table only contains minimal amount of data. Alternative is possible containing all data, with the second select merely retrieving the data. In this case FORCE INDEX may be necessary.

An in-server implementation of this alternative sorting method is suggested in "suggested fix" field.

How to repeat:
Benchmark queries using filesort with a big result set and many connections.

Suggested fix:
The abovementioned construct can be implemented in the server re-using B-Tree code from MyISAM, even using the key_buffer with writing to disk for very large result sets. (discussed with Brian Aker - he agrees)

The optimizer can find the sort key for the next step in the following way:
 - for initial row retrieval, check group by (if there is one), or order by.
 - for group by/having processin, check order by.

This method may be more effective than filesort even for smal(ler) result sets, but that would have to be tested. An appropriate selection mechanism can then be implemented in the optimizer. SQL_xxx flags to override may also be useful.

Thank you for a reasonable feature request.

- The proposals and the arguments of the reporter are quite disputable: apparently the reporter forgets about sorting together with the output data. 

I move the case to 'Not a bug'.

Hi Igor. Thanks for your comment.
Please note that this was a feature request already, so declaring it "not a bug" seems a bit superflous.
Also, you may have missed the intent of the trick. The output is indeed sorted, and it's sorted by indexing the intermediate result set. And as the original report indicates, Brian Aker thought the idea has merit.

It would be great to not see it just dismissed like this.
Thanks