MySQL Bugs: #104038: Statement digest is not reliable with big queries

Bug #104038	Statement digest is not reliable with big queries
Submitted:	16 Jun 2021 12:55	Modified:	17 Jun 2021 6:24
Reporter:	Kaiwang CHen (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Parser	Severity:	S4 (Feature request)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Background:

MySQL provides a digest facility (sql_digest.cc) to identify a group of queries that belong to a conceptual query template. The identity consists of a query digest and the current schema for unqualified identifiers.

The query digest is essentially an array of lexical token numbers and identifiers, with literals having been reduced to DIGEST specific tokens. User-visible information, such as the digest hash (compute_digest_hash) and the digest text (compute_digest_text), are derived from the token array. Related user functions are: statement_digest() and statement_digest_text().

The length of the token array is roughly dependent on that of the query. Two system variables , max_digest_length and performance_schema_max_digest_length, specify the maximum sizes for the token buffer (sql_digest_storage::m_token_array).

Problem:

If the token array is oversized, it is truncated, leaving only the prefix stored in the buffer. In this case, two query templates sharing the same prefix can not be distinguished.

Although the failure to identify templates could be alleviated by setting the system variables to a bigger limit than the default (1024), they are read-only and unknown in advance, making them impractical in production systems.

Benefit:

With a reliable statement identity, 

1) the rewriter plugin may work better, because it depends on the digest hash to do first-level matching.

2) reliable per-statement aggregation of execution metrics in performance schema or similar workload analysis module.

3) the query or slow log could be enhanced to support aggregation tools without custom digest facility as in pt-query-digest.

Proposed solution:

See AppendJumble() in pg_stat_statments.c which is a builtin extension for PostgreSQL. Jumble is a similar concept as digest.

How to repeat:
To help reproducing the problem, a small limit is used in the following examples.

mysql> show variables like 'max_digest%';
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| max_digest_length | 4     |
+-------------------+-------+
1 row in set (0.01 sec)

mysql> select statement_digest_text('select 1 + 1');
+---------------------------------------+
| statement_digest_text('select 1 + 1') |
+---------------------------------------+
| SELECT ?                              |
+---------------------------------------+
1 row in set (0.00 sec)

mysql> select statement_digest('select 1 + 1');
+------------------------------------------------------------------+
| statement_digest('select 1 + 1')                                 |
+------------------------------------------------------------------+
| d1b44b0c19af710b5a679907e284acd2ddc285201794bc69a2389d77baedddae |
+------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> select statement_digest_text('select 1 - 1');
+---------------------------------------+
| statement_digest_text('select 1 - 1') |
+---------------------------------------+
| SELECT ?                              |
+---------------------------------------+
1 row in set (0.00 sec)

mysql> select statement_digest('select 1 - 1');
+------------------------------------------------------------------+
| statement_digest('select 1 - 1')                                 |
+------------------------------------------------------------------+
| d1b44b0c19af710b5a679907e284acd2ddc285201794bc69a2389d77baedddae |
+------------------------------------------------------------------+
1 row in set (0.00 sec)

Suggested fix:
There is an alternative way to produce hash with fixed buffer size. Say the string to hash is "s1s2s3s4" where sN represents a substring fitting the fixed buffer, the hash value h could be calculated as:

h1 = hash(s1)
h2 = hash(h1s2)
h3 = hash(h2s3)
h = hash(h3s4)

However, since the token buffer is reduced according to previous tokens, it may be not easy to decide the substring boundaries. I am not sure if the reduce mechanism could be hacked, or replaced with some plain literal-to-placeholder strategy.

Hello Kaiwang,

Thank you for the feature request!

regards,
Umesh