Bug #104040 Update histogram with json data
Submitted: 16 Jun 2021 13:26 Modified: 28 Jun 10:10
Reporter: Kaiwang CHen (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S4 (Feature request)
Version:8.0 OS:Any
Assigned to: CPU Architecture:Any

[16 Jun 2021 13:26] Kaiwang CHen
Description:
Histogram was introduced in 8.0 to support estimating filter factor, however, gathering histogram is a heavy operation.

A workaround may be gathering histograms in a separated standby machine then setting the constructed histogram directly in the production machine. However, MySQL currently does not support setting histograms directly.

How to repeat:
1. Gather histogram in a standby machine with

ANALYZE TABLE t UPDATE HISTOGRAM ON c1;    -- needs write permission
SELECT * FROM information_schema.column_statistics;

or,

SELECT histogram(c1) FROM t;   -- read-only, not supported yet
SELECT c1 FROM t;  -- read-only, construct histogram elsewhere

2. set the histogram in a production machine (this issue).

Suggested fix:
Add a new syntax (set histogram):

ANALYZE TABLE t UPDATE HISTOGRAM ON c1 USING DATA '...';

Existing syntax (gather histogram):

ANALYZE TABLE t UPDATE HISTOGRAM ON c1,... WITH n BUCKETS;

Some notes:

1. JSON in the SQL is standard json rather than binary json, so it needs compatible json_to_histogram() functions.
2. some fundamental structural checks against the input json.
3. values in the input json should conform to column definition, although it is almost impossible to check real meaning.

Example:

mysql> analyze table t update histogram on c1 using data '{"buckets": [[1, 0.3333333333333333], [9, 0.6666666666666666], [10, 1.0]], "data-type": "int", "null-values": 0.0, "collation-id": 8, "last-updated": "2021-06-08 14:19:53.088572", "sampling-rate": 1.0, "histogram-type": "singleton", "number-of-buckets-specified": 100}';
+--------+-----------+----------+-----------------------------------------------+
| Table  | Op        | Msg_type | Msg_text                                      |
+--------+-----------+----------+-----------------------------------------------+
| test.t | histogram | status   | Histogram statistics created for column 'c1'. |
+--------+-----------+----------+-----------------------------------------------+
1 row in set (0.02 sec)

mysql> select * from information_schema.column_statistics\G
*************************** 1. row ***************************
SCHEMA_NAME: test
 TABLE_NAME: t
COLUMN_NAME: c1
  HISTOGRAM: {"buckets": [[1, 0.3333333333333333], [9, 0.6666666666666666], [10, 1.0]], "data-type": "int", "null-values": 0.0, "collation-id": 8, "last-updated": "2021-06-16 13:04:58.234165", "sampling-rate": 1.0, "histogram-type": "singleton", "number-of-buckets-specified": 100}
1 row in set (0.00 sec)
[17 Jun 2021 6:24] MySQL Verification Team
Hello Kaiwang,

Thank you for the feature request!

regards,
Umesh
[21 Sep 2021 1:33] casa zhang
Hi,

This is really an useful feature. Besides offload of histogram gathering tasks, it helps tools to migrate optimization context between servers.

I have a patch for it. Note that this patch depends on that of Bug#104108 to be stable on tests, so it is also included from reference.
[21 Sep 2021 1:37] casa zhang
ANALYZE TABLE UPDATE HISTOGRAM USING DATA

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: 0001-Bug-104040-ANALYZE-TABLE-UPDATE-HISTOGRAM-USING-DATA.patch (application/octet-stream, text), 230.51 KiB.

[28 Jun 10:10] Jon Stephens
Fixed in MySQL 8.0.31 as part of WL#15123. See same for docs info.

Closed.
[20 Jul 15:30] Frederic Descamps
Thank you for your contribution !

As you could read this has been included in MySQL 8.0.31.

Cheers
[12 Oct 15:10] Jon Stephens
Documented fix as follows in the MySQL 8.0.31 changelog:

    It is now possible to set a column histogram to a user-specified
    JSON value. This can be useful when sampling leaves out
    important values. This also means that a secondary (replica)
    MySQL server can assume the work of sampling the data and
    building the histogram, which can then be used on the primary
    (source) without impacting its performance.
    
    This capability is provided through an extension to ANALYZE
    TABLE ... UPDATE HISTOGRAM. The following statement sets the
    histogram for column col_name of table tbl_name to the
    histogram's JSON representation json_data:
    
ANALYZE TABLE tbl_name 
UPDATE HISTOGRAM ON col_name
USING DATA 'json_data'

    You can confirm afterwards this by checking the value of the
    HISTOGRAM column of the Information Schema COLUMN_STATISTICS
    table.
    
    See Histogram Statistics Analysis, for more information and
    examples.
    
    Our thanks to Kaiwen Chen for this contribution to MySQL.

Also updated "ANALYZE TABLE Syntax" in the 8.0 Manual.

(Already closed.)