Description:
We are using MySQL for web applications which demand response times of
at most a few seconds. To improve the performance we activated the query
cache but it turned out that this had very bad side effects.
The problem that ensued was the following: In our system there are long
periods (e.g., 8 hours) of read-only database use. The first few write
accesses afterwards take terribly long (3 minutes each) and MySQL
completely blocks all other queries during this time, using one CPU core
of the server in question fully, the other cores idling. This effect
occurs with a query cache of size 1 GB.
It was easy to trace this problem down to the query cache (as without
the query cache or with a radically smaller query cache the problem
disappeared completely), so as a workaround we currently use a much
smaller query cache of 32 MB only. This seems wasteful (the machine has
eight GB of RAM and the indexes of the database in question fit in less
than 1 GB).
Some infos on the system used: A Sun with two dual core Opterons,
2,4 GHz, under Solaris 10 x86, using MySQL built from source using gcc.
The server is running no other applications besides one instance of
MySQL. I have tested several MySQL versions from 4.1.x upto 5.0.27 that
all show this problematic behaviour.
How to repeat:
To reproduce the problem configure MySQL to use a query cache of 1 GB,
create a database containing a few tables with enough rows each that
several million different results can be retrieved by selects and fill
the query cache by issuing a few million different selects against these
tables. Take care that the results have a size of at most a few KB each
so that a large number of queries fit in the cache (around one million
would be a nice count). Also, the results should be of many different
sizes, say, all numbers between 500 and 1500 (Bytes). Then issue a
delete, insert or update against a table for which a lot of queries
have been cached and watch it take a very long time and block the server.
Suggested fix:
At the heart of the problem is the way that the query cache manages its
free memory. It uses lists of free blocks that are kept sorted by the
size of the blocks. Finding a free block for a new query is very fast
(a binary search for the right bin, and then accessing at most a few
list elements from the end of a linked list). But when freeing a block
it must be inserted in the sorted list at the right place which has a
runtime of O(n) in the number of list elements, furthermore reading
through random RAM locations thereby thrashing the CPU data cache, as
the list elements are stored in the free blocks themselves and not
localised.
While this works nicely as long as there is only a very small number
of free blocks, it breaks down completely when this is not the case.
I would like to argue (based on observation) that the latter situation
is very common and that it is important to optimize for it.
Please note also that the algorithm to find a free block seems to be
intended to find a "best fit", but it can't do that when a bin contains
more than 10 elements, leading to fragmentation. (For example, if a bin
contains blocks sized from 4096 to 5032 bytes, the list starting with,
say, 30 blocks of size 4096, 10 block of size 4104 and so on, ending
with 10 blocks of size 5032, a request for a block of size 4104 will be
honored with a block of size 5032.)
There are two important situations where the number of free blocks is
large, namely:
1. During normal use of the query cache after "warm up" a stable state
is reached where the cache is nearly full so that every new query
that is stored forces about one old one to be expunged. This leads
to fragmentation, so that a nonnegligible number of free blocks
ensues. I have measured in our setup that in this case (with a cache
size of 1 GB) we have:
Qcache_free_blocks 26586
Qcache_free_memory 96334344
Qcache_queries_in_cache 721064
Qcache_total_blocks 1471745
The free blocks are distributed unevenly over the 77 bins in this
case, the largest of the bins containing some 1000 entries.
2. When the query cache contains a large number of queries and a large
part of them is invalidated by an INSERT, UPDATE or DELETE statement.
I measured for example the following:
- The query cache contained about 700000 queries.
- An INSERT statement invalidated 200000 of them.
- The statement took 3 minutes to execute (during which time MySQL
completely blocked all other queries, even completely unrelated
ones)
- Afterwards the query cache had some 200000 free blocks.
(Please note that the second situation can also be triggered by
RESET QUERY CACHE, which was noted in Bug #21051, the fix for which
unfortunately didn't go at the heart of the problem but only papered
over it).
To summarize, we get the following behaviour:
- During regular use, adding a query to the cache means also deleting
approximately one old one. This operation is O(n) in the number of
free blocks which is bad as it means that CPU time is wasted and a
lock is held unnecessarily long.
- If a large number of queries is being invalidated at once, this
operation takes O(n^2) in the number of invalidated queries which is
very bad as it blocks MySQL completely during this time.
Suggested fix:
Please change the free block management algorithm to one that is much
faster in the general case and in the worst case.
I can think of a very simple change that would kill two birds with one
stone, namely making the insertion of free blocks much faster and
at the same time reducing fragmentation: Keep the lists in the bins
unsorted. This would make both insertion and deletion O(1).
When searching for a new block just try a few elements from one end of
the list of a suitable bin and take the one that fits best. (If the bin
contains only a few elements this would find the same block as the
current algorithm. If the bin contains many elements this should allow
more requests to find better fitting blocks than currently.)
IMHO, no algorithm for insertion and deletion of free blocks that is
slower than O(1) in the number of free blocks should be used.
Even with an O(1) algorithm, I would estimate, it would take nearly a
second to invalidate 200000 of 700000 queries on our hardware. This
is just about acceptable for our soft real time requirements. Even
better would be if even in this case the lock is held for a shorter
time, say one millisecond at the most. But that would surely require
much bigger changes to the query cache.
If needed, I can provide more details of the measurements that I made.
Please feel free to ask.
Kind regards and many thanks for providing MySQL!