Bug #31140 SELECT BLOB in query ordered by datetime improves performance
Submitted: 22 Sep 2007 4:36 Modified: 11 Dec 2007 6:22
Reporter: Darren Gordon Email Updates:
Status: Verified Impact on me:
Category:MySQL Server: Optimizer Severity:S4 (Feature request)
Version:5.0.45 OS:Any
Assigned to: CPU Architecture:Any
Tags: BLOB, performance increase, selecting a blob, unrequired blob
Triage: Triaged: D5 (Feature request)

[22 Sep 2007 4:36] Darren Gordon
Selecting a blob field in a query that is ordered by a datetime and has at least one where clause results in a performance increase between 20 and 45%.

Observe the following query:

SELECT id, bid, cid, enum1, bool1, bool2, bool3, 
	   magic_blob # unrequired null blob increases performance 
FROM  table1 
WHERE bool3 = 1 
LIMIT 0, 100

Tested where all values of magic_blob are null.  Removing magic_blob from the fields list actually decreases performance of the query by 22%.  In more complex queries, performance increases of up to 45% have been observed by selecting an empty blob (for no reason other than the performance increase).

A php script that creates and benchmarks test data is here: http://www.pastebin.ca/706687

How to repeat:
1. Create a test table
  `id` int(11) NOT NULL auto_increment,
  `bid` int(11) NOT NULL,
  `cid` int(11) NOT NULL,
  `enum1` ENUM('VAL1', 'VAL2', 'VAL3') default 'VAL3',
  `bool1` tinyint(1) default 0,
  `bool2` tinyint(1) default 0,
  `bool3` tinyint(1) default 0, 
  `stamp` datetime, 
  `magic_blob` blob, 
  PRIMARY KEY  (`id`) 

Populate table with data (ie 100,000 rows).

3. Benchmark this query:
SELECT id, bid, cid, enum1, bool1, bool2, bool3 
FROM  table1 
WHERE bool3 = 1 
LIMIT 0, 100

4. Benchmark this query:
SELECT id, bid, cid, enum1, bool1, bool2, bool3, magic_blob 
FROM  table1 
WHERE bool3 = 1 
LIMIT 0, 100

Expected Result:
The first query should be faster.

Actual Result:
The second query is faster.
[23 Sep 2007 6:13] Valeriy Kravchuk
Thank you for a problem report. What exact version, 5.0.x, you had used? Is this problem repeatable with 5.0.45?
[24 Sep 2007 9:18] Darren Gordon
It was discovered on 5.0.45 and has been repeated by others on various 5.0 versions, on various OS's.

Updated version.
[25 Sep 2007 13:23] Miguel Solorzano
Are you using query cache?. If yes please disable it and test again. Thanks
in advance.
[25 Sep 2007 18:17] Darren Gordon
Tested with non cached queries by setting query_cache_size to 0; and again with RESET QUERY CACHE.

Issue occurs in both cases.

Issue does not occur if the results are succesfully cached and reused.

[25 Sep 2007 18:29] Darren Gordon
In case it's helpful, I was reading this on Sergey Petrunia's blog; perhaps the extra blob triggers 'Mode 2' of filesort(), which ends up being faster than 'Mode 1' ?



filesort() has two modes of operation:

   1. Mode 1: the sorted elements contain all required columns of the source table. The result of the sorting is a linear sequence of output tuples, there is no need to access the source table after the sort is done.
   2. Mode 2: sort <sort_key, rowid> pairs and produce a sequence of rowids which one can use to get source table's rows in the required order (but this will be essentially hit the table in random order and is not very fast)

Mode 1 is used whenever possible. Mode [2] is used when mode1 is not applicable. This is the case when the sorted tuples have blobs or variable-length columns (TODO: check w/ evgen). Unfortunately, the EXPLAIN output provides no clue about which mode is used, so you'll have to manually check for blobs in the output columns list.
[8 Dec 2007 20:06] Valeriy Kravchuk
Thank you for a problem report and analysis. Yes, different filesort mode looks like a proper explanation. 

Do you agree to make this report a feature request for EXPLAIN to show filesort mode used?
[10 Dec 2007 10:57] Darren Gordon
There is still the issue that method 2 is not selected for this particular query, despite being faster - unless mysql is tricked into doing so by adding a blob to the query.  Perhaps this method is generally faster for dates and should be selected?

I understand it may be difficult to select the fastest sort method for every single scenario; so I agree that showing the filesort method in EXPLAIN would be useful for optimizing such queries.
[11 Dec 2007 6:22] Valeriy Kravchuk
I think, both considering all filesort implemetations for every query and mentioning selected one in EXPLAIN results are valid feature requests.