| Bug #104789 | Equi-height histogram is not equi-height | ||
|---|---|---|---|
| Submitted: | 1 Sep 2021 12:33 | Modified: | 10 Feb 2022 1:15 |
| Reporter: | Sergei Petrunia | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Server: Optimizer | Severity: | S3 (Non-critical) |
| Version: | 8.0.26 | OS: | Any |
| Assigned to: | CPU Architecture: | Any | |
[1 Sep 2021 12:34]
Sergei Petrunia
Forgot the two tables used to fill the dataset: create table ten(a int primary key); insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9); create table one_k(a int primary key); insert into one_k select a.a + b.a* 10 + c.a * 100 from ten a, ten b, ten c;
[1 Sep 2021 13:18]
MySQL Verification Team
Hello Sergei, Thank you for the report and test case. regards, Umesh
[7 Sep 2021 12:41]
Sergei Petrunia
Another example:
create table t41 (a int);
insert into t41
select A.a
from
one_k A;
insert into t41 select
500 + A.a/100
from
one_k A, ten B
where
B.a <9;
insert into t41 select
500 + A.a/100
from
one_k A, ten B
;
This generates this dataset:
select a,count(*) from t41 group by a order by a;
+------+----------+
| a | count(*) |
+------+----------+
| 0 | 1 |
| 1 | 1 |
...
| 498 | 1 |
| 499 | 1 |
| 500 | 951 |
| 501 | 1901 |
| 502 | 1901 |
| 503 | 1901 |
| 504 | 1901 |
| 505 | 1901 |
| 506 | 1901 |
| 507 | 1901 |
| 508 | 1901 |
| 509 | 1901 |
| 510 | 951 |
| 511 | 1 |
| 512 | 1 |
...
| 998 | 1 |
| 999 | 1 |
+------+----------+
analyze table t41 update histogram on a with 20 buckets;
select json_pretty(histogram) from information_schema.column_statistics where table_name='t41';
{
"buckets": [
[ 0, 500, 0.07255, 501 ],
[ 501, 501, 0.1676, 1 ],
[ 502, 502, 0.26265, 1 ],
[ 503, 503, 0.3577, 1 ],
[ 504, 504, 0.45275, 1 ],
[ 505, 505, 0.5478, 1 ],
[ 506, 506, 0.64285, 1 ],
[ 507, 507, 0.7379, 1 ],
[ 508, 508, 0.83295, 1 ],
[ 509, 509, 0.928, 1 ],
[ 510, 510, 0.97555, 1 ],
[ 511, 511, 0.9756, 1 ],
[ 512, 512, 0.97565, 1 ],
[ 513, 513, 0.9757, 1 ],
[ 514, 514, 0.97575, 1 ],
[ 515, 515, 0.9758, 1 ],
[ 516, 516, 0.97585, 1 ],
[ 517, 517, 0.9759, 1 ],
[ 518, 518, 0.97595, 1 ],
[ 519, 999, 1.0, 481 ]
],
"data-type": "int",
"null-values": 0.0,
"collation-id": 8,
"last-updated": "2021-09-07 12:07:17.788334",
"sampling-rate": 1.0,
"histogram-type": "equi-height",
"number-of-buckets-specified": 20
}
Taking the 3rd value from each array, substracting it from the previous value,
and multiplying by 20000 (number of rows in the table), and we get:
min max rows_in_bucket
0 500 1451
501 501 1901
502 502 1901
503 503 1901
504 504 1901
505 505 1901
506 506 1901
507 507 1901
508 508 1901
509 509 1901
510 510 951
511 511 1
512 512 1
513 513 1
514 514 1
515 515 1
516 516 1
517 517 1
518 518 1
519 999 481
rows_per_bucket
1451
1901
1901
1901
1901
1901
1901
1901
1901
1901
951
1
1
1
1
1
1
1
1
481
Thinking of the possible causes...
we have 20K rows, 20 buckets. this gives 1K rows/bucket
Doing a linear scan through the buckets, we put the values into the first
bucket until the value "500" exceeds the limit.
The subsequent values up to and including 510 take their own buckets.
Then, something odd happens: we have 9 buckets left and 490 values.
We start putting each row into its own bucket until we have one free bucket
where we put all the rows?
[10 Feb 2022 1:15]
Jon Stephens
Documented fix as follows in the MySQL 8.0.30 changelog:
MySQL supports the use of equiheight histograms to improve
selectivity estimates. Each bucket in an equiheight histogram
for a column should contain roughly the same number of values
(rows); keeping the buckets small helps minimize any error.
When constructing an equiheight histogram, too many values were
sometime placed in the same bucket, which could result in
substantial errors in selectivity estimation. We fix this by
introducing a new equiheight construction algorithm that
guarantees low error, and adapts to the distribution of the data
to make efficient use of its buckets. In addition, a new
estimator for the number of distinct values in histogram buckets
provides improved worst-case error guarantees.
Closed.

Description: I've encountered something that looks like a bug to me. Maybe, MySQL histograms were supposed to work this way, but I doubt this, so I'm filing this example here to bring it to MySQL developers' attention. In short: a histogram that is generated for certain non-uniform data distribution is not at all equi-height. How to repeat: # Test dataset create table t30 ( a varchar(100)); insert into t30 select 'aaa' from one_k limit 1; insert into t30 select 'bbb' from one_k limit 2; insert into t30 select 'eee' from one_k limit 4; insert into t30 select concat('fff-',a) from one_k limit 8; insert into t30 select concat('ggg-',a) from one_k limit 16; insert into t30 select concat('kkk-',a) from one_k limit 16; insert into t30 select concat('mmm-',a) from one_k limit 32; insert into t30 select 'nnn' from one_k limit 64; insert into t30 select 'ppp' from one_k limit 125; insert into t30 select 'www' from one_k limit 250; insert into t30 select 'zzz' from one_k limit 500; select a, count(*) from t30 group by a order by a; +--------+----------+ | a | count(*) | +--------+----------+ | aaa | 1 | | bbb | 2 | | eee | 4 | | fff-90 | 1 | | fff-91 | 1 | | fff-92 | 1 | | fff-93 | 1 | ... # more rows with count=1 here... ... | mmm-97 | 1 | | mmm-98 | 1 | | mmm-99 | 1 | | nnn | 64 | | ppp | 125 | | www | 250 | | zzz | 500 | +--------+----------+ 79 rows in set (0.02 sec) analyze table t30 update histogram on a with 20 buckets; select json_pretty(histogram) from information_schema.column_statistics where table_name="t30" and schema_name=database()\G Produces this output: https://gist.github.com/spetrunia/a79f114f6b89ff1fa961a02fb09caf0b Write a trivial script to decode the histogram: https://gist.github.com/spetrunia/6cf92913cce8d42eed67b5dec3b509f6 Run it on the above histogram and get this: https://gist.github.com/spetrunia/1eabb3340f818fbf70e558c8695891e8 Things of interest are: There are many buckets that contain just one value, for example: [ "b64dec:type254:mmm-90", "b64dec:type254:mmm-90", 0.068762278978389, 1 ], [ "b64dec:type254:mmm-91", "b64dec:type254:mmm-91", 0.06974459724950884, 1 ], But then, the two most popular values (750 rows out of 1000 have them) were put into the same bucket: [ "b64dec:type254:www", "b64dec:type254:zzz", 1.0, 2 ] That is: - values that occur once are in separate buckets - values that occur in 25% and 50% of the rows, are put into a shared bucket. this doesn't look like an equi-height histogram. Suggested fix: Distribute values between buckets so that bucket sizes are more uniform. at very least, put www and zzz into different buckets.