MySQL Bugs: #104109: Equi-height histogram always estimate non-integral equality predicate to 1/rows

Bug #104109	Equi-height histogram always estimate non-integral equality predicate to 1/rows
Submitted:	24 Jun 2021 17:56	Modified:	13 Jan 2022 17:13
Reporter:	Kaiwang CHen (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S3 (Non-critical)
Version:	8.0.25	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Equi-height histogram estimates equal_to with the following code.

template <class T>
double Equi_height<T>::get_equal_to_selectivity(const T &value) const {

  // ...

  return (bucket_frequency / found->get_num_distinct()) *
         found->value_probability();
}

The value probability is std::numeric_limits<double>::max() except for longlong and ulonglong, so it always gives an almost zero selectivity which is then capped to "1 / records" in calculate_condition_filter():

filter = max(filter, 1.0f / tab->records());

So it makes Equi-height histogram useless.

Here's a quick reference for value_probability():

template <class T>
double Bucket<T>::value_probability() const {
  /*
    As default, assume that the number of possible values between the lower and
    upper value of the bucket is VERY high.
  */
  if (values_are_equal(get_lower_inclusive(), get_upper_inclusive()))
    return 1.0;
  return get_num_distinct() / std::numeric_limits<double>::max();
}

template <>
double Bucket<longlong>::value_probability() const {
  return get_num_distinct() / (static_cast<double>(get_upper_inclusive()) -
                               get_lower_inclusive() + 1);
}

template <>
double Bucket<ulonglong>::value_probability() const {
  return get_num_distinct() / (static_cast<double>(get_upper_inclusive()) -
                               get_lower_inclusive() + 1);
}

How to repeat:
create table t (c1 double);

insert t values (1.0), (2.0), (3.0), (4.0), (5.0), (2.0), (2.0), (5.0), (5.0), (5.0);

analyze table t;
analyze table t update histogram on c1 with 2 buckets;

-- select * from information_schema.column_statistics;
{"buckets": [[1.0, 3.0, 0.5, 3], [4.0, 5.0, 1.0, 2]], "data-type": "double", "null-values": 0.0, "collation-id": 8, "last-updated": "2021-06-24 17:25:26.794012", "sampling-rate": 1.0, "histogram-type": "equi-height", "number-of-buckets-specified": 2}

mysql> explain select * from t where c1 = 2.0;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | t     | NULL       | ALL  | NULL          | NULL | NULL    | NULL |   10 |    10.00 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.01 sec)

mysql> explain select * from t where c1 = 3.0;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | t     | NULL       | ALL  | NULL          | NULL | NULL    | NULL |   10 |    10.00 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

mysql> explain select * from t where c1 = 5.0;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | t     | NULL       | ALL  | NULL          | NULL | NULL    | NULL |   10 |    10.00 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

Suggested fix:
Use (bucket_frequency / found->get_num_distinct()) is common practice.

For example, PostgreSQL uses (1 - null_ratio - sum(frequency of most_common_values)) / number_of_distinct_regular_values.

In addition, it is better to treat abnormal values separately such as 5.0 in the repeated example. Note that Oracle has hybrid histogram, PostgreSQL has MCV part, and SQL Server also record occurrences for bucket endpoints.

Hello Kaiwang,

Thank you for the report and feedback.

regards,
Umesh

Documented fix as follows in the MySQL 8.0.29 changelog:

    The two types of histograms supported by MySQL, singleton and
    equi-height, previously stored their collections of buckets in a
    map and a set, respectively. This release changes both histogram
    types such that they now store buckets in dynamic arrays
    instead, which reduces space overhead and speeds up selectivity
    estimation due to reduced indirection when performing binary
    search on the buckets.

    Because both types of histograms are constructed by inserting
    buckets in sorted order, and buckets are serialized to JSON in
    sorted order, there is no additional work involved in switching
    to a dynamic array.

Closed.