MySQL Bugs: #104789: Equi-height histogram is not equi-height

Bug #104789	Equi-height histogram is not equi-height
Submitted:	1 Sep 2021 12:33	Modified:	10 Feb 2022 1:15
Reporter:	Sergei Petrunia	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S3 (Non-critical)
Version:	8.0.26	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
I've encountered something that looks like a bug to me. Maybe, MySQL histograms were supposed to work this way, but I doubt this, so I'm filing this example here to bring it to MySQL developers' attention.

In short: a histogram that is generated for certain non-uniform data distribution is not at all equi-height.

How to repeat:
# Test dataset

create table t30 ( a varchar(100));

insert into t30 select 'aaa' from one_k limit 1;
insert into t30 select 'bbb' from one_k limit 2;
insert into t30 select 'eee' from one_k limit 4;
insert into t30 select concat('fff-',a) from one_k limit 8;
insert into t30 select concat('ggg-',a) from one_k limit 16;
insert into t30 select concat('kkk-',a) from one_k limit 16;
insert into t30 select concat('mmm-',a) from one_k limit 32;
insert into t30 select 'nnn' from one_k limit 64;
insert into t30 select 'ppp' from one_k limit 125;
insert into t30 select 'www' from one_k limit 250;
insert into t30 select 'zzz' from one_k limit 500;

select a, count(*) from t30 group by a order by a;
+--------+----------+
| a      | count(*) |
+--------+----------+
| aaa    |        1 |
| bbb    |        2 |
| eee    |        4 |
| fff-90 |        1 |
| fff-91 |        1 |
| fff-92 |        1 |
| fff-93 |        1 |
...
# more rows with count=1 here...
...
| mmm-97 |        1 |
| mmm-98 |        1 |
| mmm-99 |        1 |
| nnn    |       64 |
| ppp    |      125 |
| www    |      250 |
| zzz    |      500 |
+--------+----------+
79 rows in set (0.02 sec)

analyze table t30 update histogram on a with 20 buckets;

select json_pretty(histogram)
from information_schema.column_statistics 
where table_name="t30" and schema_name=database()\G

Produces this output:
https://gist.github.com/spetrunia/a79f114f6b89ff1fa961a02fb09caf0b

Write a trivial script to decode the histogram:
https://gist.github.com/spetrunia/6cf92913cce8d42eed67b5dec3b509f6

Run it on the above histogram and get this:

https://gist.github.com/spetrunia/1eabb3340f818fbf70e558c8695891e8

Things of interest are:
There are many buckets that contain just one value, for example:

    [
      "b64dec:type254:mmm-90",
      "b64dec:type254:mmm-90",
      0.068762278978389,
      1
    ],
    [
      "b64dec:type254:mmm-91",
      "b64dec:type254:mmm-91",  
      0.06974459724950884,
      1
    ],

But then, the two most popular values (750 rows out of 1000 have them) were put
into the same bucket:

    [
      "b64dec:type254:www",
      "b64dec:type254:zzz",
      1.0,
      2
    ]

That is:
- values that occur once are in separate buckets
- values that occur in 25% and 50% of the rows, are put into a shared bucket.
this doesn't look like an equi-height histogram.

Suggested fix:
Distribute values between buckets so that bucket sizes are more uniform. at very least, put www and zzz into different buckets.

Forgot the two tables used to fill the dataset:

create table ten(a int primary key);
insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);

create table one_k(a int primary key);
insert into one_k select a.a + b.a* 10 + c.a * 100 from ten a, ten b, ten c;

Hello Sergei,

Thank you for the report and test case.

regards,
Umesh

Another example:

create table t41 (a int);
insert into t41
select A.a 
from 
  one_k A;

insert into t41 select 
  500 + A.a/100
from
  one_k A, ten B
where
 B.a <9;

insert into t41 select 
  500 + A.a/100
from
  one_k A, ten B
;

This generates this dataset:

select a,count(*) from t41 group by a order by a;
+------+----------+
| a    | count(*) |
+------+----------+
|    0 |        1 |
|    1 |        1 |
...
|  498 |        1 |
|  499 |        1 |
|  500 |      951 |
|  501 |     1901 |
|  502 |     1901 |
|  503 |     1901 |
|  504 |     1901 |
|  505 |     1901 |
|  506 |     1901 |
|  507 |     1901 |
|  508 |     1901 |
|  509 |     1901 |
|  510 |      951 |
|  511 |        1 |
|  512 |        1 |
...
|  998 |        1 |
|  999 |        1 |
+------+----------+

analyze table t41 update histogram on a with 20 buckets;

select json_pretty(histogram) from information_schema.column_statistics where table_name='t41';

{
  "buckets": [
    [ 0, 500, 0.07255, 501 ],
    [ 501, 501, 0.1676, 1 ],
    [ 502, 502, 0.26265, 1 ],
    [ 503, 503, 0.3577, 1 ],
    [ 504, 504, 0.45275, 1 ],
    [ 505, 505, 0.5478, 1 ],
    [ 506, 506, 0.64285, 1 ],
    [ 507, 507, 0.7379, 1 ],
    [ 508, 508, 0.83295, 1 ],
    [ 509, 509, 0.928, 1 ],
    [ 510, 510, 0.97555, 1 ],
    [ 511, 511, 0.9756, 1 ],
    [ 512, 512, 0.97565, 1 ],
    [ 513, 513, 0.9757, 1 ],
    [ 514, 514, 0.97575, 1 ],
    [ 515, 515, 0.9758, 1 ],
    [ 516, 516, 0.97585, 1 ],
    [ 517, 517, 0.9759, 1 ],
    [ 518, 518, 0.97595, 1 ],
    [ 519, 999, 1.0, 481 ]
  ],
  "data-type": "int",
  "null-values": 0.0,
  "collation-id": 8,
  "last-updated": "2021-09-07 12:07:17.788334",
  "sampling-rate": 1.0,
  "histogram-type": "equi-height",
  "number-of-buckets-specified": 20
}

Taking the 3rd value from each array, substracting it from the previous value,
and multiplying by 20000 (number of rows in the table), and we get:

min     max     rows_in_bucket
0	500	1451
501	501	1901
502	502	1901
503	503	1901
504	504	1901
505	505	1901
506	506	1901
507	507	1901
508	508	1901
509	509	1901
510	510	951
511	511	1
512	512	1
513	513	1
514	514	1
515	515	1
516	516	1
517	517	1
518	518	1
519	999	481

rows_per_bucket
1451
1901
1901
1901
1901
1901
1901
1901
1901
1901
951
1
1
1
1
1
1
1
1
481

Thinking of the possible causes...
we have 20K rows, 20 buckets. this gives 1K rows/bucket
Doing a linear scan through the buckets, we put the values into the first
bucket until the value "500" exceeds the limit.

The subsequent values up to and including 510 take their own buckets.
Then, something odd happens: we have 9 buckets left and 490 values.
We start putting each row into its own bucket until we have one free bucket 
where we put all the rows?

Documented fix as follows in the MySQL 8.0.30 changelog:

    MySQL supports the use of equiheight histograms to improve
    selectivity estimates. Each bucket in an equiheight histogram
    for a column should contain roughly the same number of values
    (rows); keeping the buckets small helps minimize any error.

    When constructing an equiheight histogram, too many values were
    sometime placed in the same bucket, which could result in
    substantial errors in selectivity estimation. We fix this by
    introducing a new equiheight construction algorithm that
    guarantees low error, and adapts to the distribution of the data
    to make efficient use of its buckets. In addition, a new
    estimator for the number of distinct values in histogram buckets
    provides improved worst-case error guarantees.

Closed.