Description:
Equi-height histogram estimates equal_to with the following code.
template <class T>
double Equi_height<T>::get_equal_to_selectivity(const T &value) const {
// ...
return (bucket_frequency / found->get_num_distinct()) *
found->value_probability();
}
The value probability is std::numeric_limits<double>::max() except for longlong and ulonglong, so it always gives an almost zero selectivity which is then capped to "1 / records" in calculate_condition_filter():
filter = max(filter, 1.0f / tab->records());
So it makes Equi-height histogram useless.
Here's a quick reference for value_probability():
template <class T>
double Bucket<T>::value_probability() const {
/*
As default, assume that the number of possible values between the lower and
upper value of the bucket is VERY high.
*/
if (values_are_equal(get_lower_inclusive(), get_upper_inclusive()))
return 1.0;
return get_num_distinct() / std::numeric_limits<double>::max();
}
template <>
double Bucket<longlong>::value_probability() const {
return get_num_distinct() / (static_cast<double>(get_upper_inclusive()) -
get_lower_inclusive() + 1);
}
template <>
double Bucket<ulonglong>::value_probability() const {
return get_num_distinct() / (static_cast<double>(get_upper_inclusive()) -
get_lower_inclusive() + 1);
}
How to repeat:
create table t (c1 double);
insert t values (1.0), (2.0), (3.0), (4.0), (5.0), (2.0), (2.0), (5.0), (5.0), (5.0);
analyze table t;
analyze table t update histogram on c1 with 2 buckets;
-- select * from information_schema.column_statistics;
{"buckets": [[1.0, 3.0, 0.5, 3], [4.0, 5.0, 1.0, 2]], "data-type": "double", "null-values": 0.0, "collation-id": 8, "last-updated": "2021-06-24 17:25:26.794012", "sampling-rate": 1.0, "histogram-type": "equi-height", "number-of-buckets-specified": 2}
mysql> explain select * from t where c1 = 2.0;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | t | NULL | ALL | NULL | NULL | NULL | NULL | 10 | 10.00 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.01 sec)
mysql> explain select * from t where c1 = 3.0;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | t | NULL | ALL | NULL | NULL | NULL | NULL | 10 | 10.00 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
mysql> explain select * from t where c1 = 5.0;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | t | NULL | ALL | NULL | NULL | NULL | NULL | 10 | 10.00 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
Suggested fix:
Use (bucket_frequency / found->get_num_distinct()) is common practice.
For example, PostgreSQL uses (1 - null_ratio - sum(frequency of most_common_values)) / number_of_distinct_regular_values.
In addition, it is better to treat abnormal values separately such as 5.0 in the repeated example. Note that Oracle has hybrid histogram, PostgreSQL has MCV part, and SQL Server also record occurrences for bucket endpoints.