Bug #104789 | Equi-height histogram is not equi-height | ||
---|---|---|---|
Submitted: | 1 Sep 2021 12:33 | Modified: | 10 Feb 2022 1:15 |
Reporter: | Sergei Petrunia | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Optimizer | Severity: | S3 (Non-critical) |
Version: | 8.0.26 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[1 Sep 2021 12:33]
Sergei Petrunia
[1 Sep 2021 12:34]
Sergei Petrunia
Forgot the two tables used to fill the dataset: create table ten(a int primary key); insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9); create table one_k(a int primary key); insert into one_k select a.a + b.a* 10 + c.a * 100 from ten a, ten b, ten c;
[1 Sep 2021 13:18]
MySQL Verification Team
Hello Sergei, Thank you for the report and test case. regards, Umesh
[7 Sep 2021 12:41]
Sergei Petrunia
Another example: create table t41 (a int); insert into t41 select A.a from one_k A; insert into t41 select 500 + A.a/100 from one_k A, ten B where B.a <9; insert into t41 select 500 + A.a/100 from one_k A, ten B ; This generates this dataset: select a,count(*) from t41 group by a order by a; +------+----------+ | a | count(*) | +------+----------+ | 0 | 1 | | 1 | 1 | ... | 498 | 1 | | 499 | 1 | | 500 | 951 | | 501 | 1901 | | 502 | 1901 | | 503 | 1901 | | 504 | 1901 | | 505 | 1901 | | 506 | 1901 | | 507 | 1901 | | 508 | 1901 | | 509 | 1901 | | 510 | 951 | | 511 | 1 | | 512 | 1 | ... | 998 | 1 | | 999 | 1 | +------+----------+ analyze table t41 update histogram on a with 20 buckets; select json_pretty(histogram) from information_schema.column_statistics where table_name='t41'; { "buckets": [ [ 0, 500, 0.07255, 501 ], [ 501, 501, 0.1676, 1 ], [ 502, 502, 0.26265, 1 ], [ 503, 503, 0.3577, 1 ], [ 504, 504, 0.45275, 1 ], [ 505, 505, 0.5478, 1 ], [ 506, 506, 0.64285, 1 ], [ 507, 507, 0.7379, 1 ], [ 508, 508, 0.83295, 1 ], [ 509, 509, 0.928, 1 ], [ 510, 510, 0.97555, 1 ], [ 511, 511, 0.9756, 1 ], [ 512, 512, 0.97565, 1 ], [ 513, 513, 0.9757, 1 ], [ 514, 514, 0.97575, 1 ], [ 515, 515, 0.9758, 1 ], [ 516, 516, 0.97585, 1 ], [ 517, 517, 0.9759, 1 ], [ 518, 518, 0.97595, 1 ], [ 519, 999, 1.0, 481 ] ], "data-type": "int", "null-values": 0.0, "collation-id": 8, "last-updated": "2021-09-07 12:07:17.788334", "sampling-rate": 1.0, "histogram-type": "equi-height", "number-of-buckets-specified": 20 } Taking the 3rd value from each array, substracting it from the previous value, and multiplying by 20000 (number of rows in the table), and we get: min max rows_in_bucket 0 500 1451 501 501 1901 502 502 1901 503 503 1901 504 504 1901 505 505 1901 506 506 1901 507 507 1901 508 508 1901 509 509 1901 510 510 951 511 511 1 512 512 1 513 513 1 514 514 1 515 515 1 516 516 1 517 517 1 518 518 1 519 999 481 rows_per_bucket 1451 1901 1901 1901 1901 1901 1901 1901 1901 1901 951 1 1 1 1 1 1 1 1 481 Thinking of the possible causes... we have 20K rows, 20 buckets. this gives 1K rows/bucket Doing a linear scan through the buckets, we put the values into the first bucket until the value "500" exceeds the limit. The subsequent values up to and including 510 take their own buckets. Then, something odd happens: we have 9 buckets left and 490 values. We start putting each row into its own bucket until we have one free bucket where we put all the rows?
[10 Feb 2022 1:15]
Jon Stephens
Documented fix as follows in the MySQL 8.0.30 changelog: MySQL supports the use of equiheight histograms to improve selectivity estimates. Each bucket in an equiheight histogram for a column should contain roughly the same number of values (rows); keeping the buckets small helps minimize any error. When constructing an equiheight histogram, too many values were sometime placed in the same bucket, which could result in substantial errors in selectivity estimation. We fix this by introducing a new equiheight construction algorithm that guarantees low error, and adapts to the distribution of the data to make efficient use of its buckets. In addition, a new estimator for the number of distinct values in histogram buckets provides improved worst-case error guarantees. Closed.