MySQL Bugs: #28241: QCache stores dublicates

Bug #28241	QCache stores dublicates
Submitted:	4 May 2007 13:51	Modified:	16 May 2007 15:55
Reporter:	Martin Friebe (Gold Quality Contributor) (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server	Severity:	S4 (Feature request)
Version:	5.0.42,5.1BK	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	qcache

Description:
I put that in as a bug, as I think it is an uneccesary resource usage that should not occur.
On the other hand the documentation specifies that those duplicates can occur for various reasons, so please feel free to downgrade to a feature request.

Queries are cached together with the values of system variables that can affect the result. This is correct and important.

However for any query all possible system variables are stored, regardless if they would affect the query.

In the example the query does not use group_concat, a change of the variable can not change the result of this query. Yet it is stored twice.

Especially with timezones, if every connection uses a different timezone, this can have a huge effect, and prevent many valid hits. (and insert dublicates instead)

How to repeat:
create table t1 (a int);
insert into t1 values (1),(2),(3);
set GLOBAL query_cache_type=1;
set GLOBAL query_cache_limit=10000;
set GLOBAL query_cache_min_res_unit=0;
set GLOBAL query_cache_size= 100000;

flush status;

set LOCAL group_concat_max_len=2000;
select 1+7 from t1;
show status like 'Qcache_inserts';
#| Qcache_inserts | 1 |

set LOCAL group_concat_max_len=3000;
select 1+7 from t1;
show status like 'Qcache_inserts';
#| Qcache_inserts | 2 |

Suggested fix:
The problem appears to be that the query cache is checked for an existing query, before the query is parsed. This saves the time of parsing and saves a valuable amount of time. But it means that the mysql server does not have knowledge if timezone-based functions (or group_concat or others) are used.

The matter can be solved. And it would probably also pose a solution for the following ToDo comment from the source code in sql_cache.cc:
"- Delayed till after-parsing qache answer (for column rights processing)"

At the time of inserting a query in the cache it is already parsed. It would be possible to collect a list of all relevant system-variables during parsing.
Only those system-variables need to be stored with the query.

Currently it would not be possible to find a query without knowing which set of system variables are attached, because the system-variables are used inside the hash-value calculation to look up the query.

1) on query insert into cache:
- Calculate a hash of the "sql statement text" (ignoring any variables or metadata)
- Use this hash as pointer to "relevant variables list" and store an entry specifying the relevant system variables
- Calculate 2nd hash of the "relevant variables header" + the "query text"
(similare to what happens currently)
- use the 2nd hash to store the result as currently done

2) on lookup
- Calculate a hash of the "sql statement text"
(This can be done without the need to know which variables are relevant)
- Lookup the "relevant variables list"
(Because if the query text is equal, this list will also be equal to what we stored earlier)
- Calculate 2nd hash of the "relevant variables header" + the "query text"
- use the 2nd has to lookup the result as currently done

----

- "relevant variables list"
System variables could be represented as an enumerated list.
For each variable there would be a storage need of one byte. Storing the enum value of the variable.

- "relevant variables header"
This would replace the current Query_cache_query_flags structure (which always contains all values)

resultset := rel_header_list , result_of_query
rel_header_list := variable_value [, variable_value [, variable_value ...]]

The order of the varaible values, is given by the "relevant variables list".
The length is known for each variable, or must be stored together with the value.

This structure would be of variable length. Yet its maximum length is known (assuming all variables are contained). So mem allocation can always be done in one step, but the hash would only look at the bytes that are realy used.

There would be exactly the relevant system variables in the list. All relevant ones would always be present, but no others.

-----

"- Delayed till after-parsing qache answer (for column rights processing)"

relevant info could be stored either with the "relevant variables list" or the resultset

-----

This could be further extended to avoid dublicate storage of syntactical differnces with no functional relevance.
(But this is more theoretical, as it comes with a lot more performance expenses)

Example case-sensivity:
select a, "FOO" from t1;
SELECT a, "FOO" FROM t1;
SELECT a, "Foo" FROM t1; # functional different

The first hash could always be calculated from an all lower case sql-text.

The "relevant variables list" would be extended by a "mask of all case insensitive chars"

The 2nd hash would always be calculated from a copy of the sql text in which all and only the case-insensitive chars are made lowercase

Ths works as long as any case-sensitive data never affects the relevant system variables.
Otherwhise (and to be safe) this would require the further overhaead of a 3rd hash.

The procedure would then be.
-1st hash look up case-sensitivity)
-2nd hash look up relevant sys variables (using the query transformed according to the case-sensivity-mask)
-3rd has look up result

So adding case-insensitivity to the qcache may be to expensive.

Also creation of the case-insensitivity mask would be a lot of extra work for the parser.

Thank you for the bug report.

Dear Martin,
thank you for a valid feature request.
I added your request to WL#1468	"Query cache performance improvements"