Bug #64924 Cacheing long SELECT statement does too many prunes
Submitted: 10 Apr 2012 9:03 Modified: 11 May 2012 16:46
Reporter: Cedric Knight Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: Query Cache Severity:S3 (Non-critical)
Version:5.1.61 OS:Linux (2.6.32-5-amd64 #1 SMP)
Assigned to: Assigned Account CPU Architecture:Any

[10 Apr 2012 9:03] Cedric Knight
Description:
It's known that the query cache does not scale, partly because (a) cache invalidation is slow, (b) cache invalidation blocks all other queries.  A third aggravating factor appears to be that inserting into the cache a SELECT statement which is larger than the largest free block in the query cache prunes up to 40% of the cache, apparently freezing the server for anything from 10 seconds to 5 minutes.

How to repeat:
An 8-core 2GHz system with 64GB RAM had the following config:

| query_cache_limit            | 134217728  |
| query_cache_min_res_unit     | 4096       |
| query_cache_size             | 2147483648 |
| query_cache_type             | ON         |
| query_cache_wlock_invalidate | OFF        |

The 2GB query cache size is about 20 times larger than recommended and, used principally for a variety of CMS applications, results in about 30% of the cache unused and very fragmented:

| Qcache_free_blocks      | 203445    |
| Qcache_free_memory      | 672118376 |
| Qcache_hits             | 177959996 |
| Qcache_inserts          | 28239641  |
| Qcache_lowmem_prunes    | 10519351  |
| Qcache_not_cached       | 2622620   |
| Qcache_queries_in_cache | 601449    |
| Qcache_total_blocks     | 1448903   |

Under the above conditions, run a novel query longer than about 70000 characters such as 
SELECT SQL_CACHE * from database.table WHERE item_id IN (1, 2, 3, 4, 5, [...], 49999, 50000) LIMIT 0,30;

This frequently (provided the free space is sufficiently fragmented, for example if such a query has not been run for a few hours, it is usually repeatable) locks the server for anywhere from a few seconds to 2 minutes, with all queries piling up.  Once the query completes, the Qcache_free_memory may have approximately doubled.  FLUSH QUERY CACHE however, only takes about 5 seconds, and appears to protect against the issue.

Note this can happen if the query returns an empty set, and is regardless of the tables or index structure, and seems solely related to the length of the statement string.  Changing to SQL_NO_CACHE the query returns results in around 40ms and does not trigger the lock.

There appear to be cases of this in bug 60074, bug 40090 and the comments in http://www.docplanet.org/mysql/mysql-query-cache-in-depth/

Suggested fix:
Without looking at the code that inserts the statement string in the cache, I suspect it looks for a sufficiently long contiguous free block to store the SELECT statement in, and not finding one, requests freeing of an equivalent amount of least-recently-used queries, then looks again, still doesn't find suitable space, frees more cache and so on.

Possible fixes include storing the statement string in the same way as the query results (at the cost of additional complexity); or at some stage in the above process compact the free space automatically; or passing a parameter to the routine that frees blocks that the target is to free x amount of contiguous space; improving the pruning efficiency; or perhaps simply bailing out and not inserting the query in the cache.

Making the pruning code more efficient has been suggested in bug 47529.  It may be that fixes to reopened bug 43758 and bug 21074 were incomplete.

It has also been suggested in bug 21051, bug 37844, and bug 54584 that the lock could be more granular or specific or the cache could be partitioned; or that when it is locked, queries should simply bypass the cache to mitigate the effect (Note also that as in bug 21051, FLUSH QUERY CACHE with query_cache_type=0 still appears to block other threads.)

Workarounds: use SQL_NO_CACHE for things like specific searches; flush query cache every few hours; reduce size of query cache.
[10 May 2012 23:58] Sveta Smirnova
Not fully repeatable with 5.5 for me: query runs longer when query cache involved, but I am not sure it spends time freeing it. It also does not block any parallel query.

SHOW PROFILE:

Status	Duration
starting	0.000566
Waiting for query cache lock	0.000011
checking query cache for query	0.316990
checking permissions	0.000022
Opening tables	0.009666
System lock	0.000023
Waiting for query cache lock	0.025033
init	0.060156
optimizing	0.024038
statistics	216.468186
preparing	106.197258
executing	0.000007
Sending data	0.029568
end	0.000012
query end	0.000007
closing tables	0.000046
freeing items	0.002317
Waiting for query cache lock	0.000013
freeing items	0.000032
Waiting for query cache lock	0.000003
freeing items	0.000003
storing result in query cache	0.000007
logging slow query	0.000003
logging slow query	0.015857
cleaning up	0.000096

real	5m23.577s
user	0m0.029s
sys	0m0.009s
[11 May 2012 11:08] Sveta Smirnova
Thank you for the report.

I can not repeat what you described exactly: SELECT query does not block other queries and spends most of time in "Statistics" and "Preparing", although there is slow down if compare with not cached version. So I'd call this one as duplicate of bug #37844 for now.

Please check this issue in your environment when bug #37844 and if that global fix does not fix this problem reopen the report.
[11 May 2012 16:46] Cedric Knight
Thanks for attempting to reproduce the bug.  It isn't straightforward to test.  Did you manage to get a similar ratio of Qcache_free_blocks to Qcache_free_memory before testing the long SELECT?

The characteristics of what I described as blocking other threads include that one CPU uses 100%, while others are unused (as shown by mpstat), and "show full processlist" shows a large number of active queries with the query concerned as in state "NULL", and subsequent SELECT statements as "Locked" and INSERTs as "update" until the first query completes.  (These are 5.1 states; I think 5.5 gives a bit more information.)  

Is there actually progress on bug 37844?  I wondered if looking at the insertion code in the above ways might mitigate the problem in a different way.