Bug #25402 Optimisation replacing filesort for bigger result sets
Submitted: 4 Jan 2007 1:58 Modified: 18 Aug 2007 19:49
Reporter: Arjen Lentz Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S4 (Feature request)
Version:any OS:Any (any)
Assigned to: CPU Architecture:Any
Tags: filesort, GROUP BY, optimization, Optimizer, order by, sorting

[4 Jan 2007 1:58] Arjen Lentz
Description:
Filesort can become very slow when dealing with big result sets on a busy system (many active connections). Even if the set can be kept completely in memory, significant CPU time is used.

An effective workaround is using a temporary table with an index:
  CREATE TEMPORARY TABLE tmp (INDEX sortkey (sortkey,id))
         SELECT sortkey,id FROM maintable

  SELECT <allfields>
    FROM tmp FORCE INDEX (sortkey)
    JOIN maintable ON (tmp.id=maintable.id)
   ORDER BY sortkey

  DROP TEMPORARY TABLE tmp

Notes:
 - The above turns out to be many times faster, reducing query time from many minutes to a few seconds or even subsecond response.
 - This is true even with the overhead of creating/dropping the temp table and the additional queries.
 - In the above, server will generally note "Using index" (index scan) on its own, so FORCE INDEX not required.
 - In the above, the tmp table only contains minimal amount of data. Alternative is possible containing all data, with the second select merely retrieving the data. In this case FORCE INDEX may be necessary.

An in-server implementation of this alternative sorting method is suggested in "suggested fix" field.

How to repeat:
Benchmark queries using filesort with a big result set and many connections.

Suggested fix:
The abovementioned construct can be implemented in the server re-using B-Tree code from MyISAM, even using the key_buffer with writing to disk for very large result sets. (discussed with Brian Aker - he agrees)

The optimizer can find the sort key for the next step in the following way:
 - for initial row retrieval, check group by (if there is one), or order by.
 - for group by/having processin, check order by.

This method may be more effective than filesort even for smal(ler) result sets, but that would have to be tested. An appropriate selection mechanism can then be implemented in the optimizer. SQL_xxx flags to override may also be useful.
[4 Jan 2007 15:27] Valeriy Kravchuk
Thank you for a reasonable feature request.
[18 Aug 2007 0:51] Igor Babaev
- The proposals and the arguments of the reporter are quite disputable: apparently the reporter forgets about sorting together with the output data. 

I move the case to 'Not a bug'.
[18 Aug 2007 1:59] Arjen Lentz
Hi Igor. Thanks for your comment.
Please note that this was a feature request already, so declaring it "not a bug" seems a bit superflous.
Also, you may have missed the intent of the trick. The output is indeed sorted, and it's sorted by indexing the intermediate result set. And as the original report indicates, Brian Aker thought the idea has merit.

It would be great to not see it just dismissed like this.
Thanks