Bug #99996 Prefer histogram over index statistics when eq_range_index_dive_limit is exceede
Submitted: 26 Jun 2020 7:58 Modified: 26 Jun 2020 8:42
Reporter: Øystein Grøvlen Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S4 (Feature request)
Version:8.0.20 OS:Any
Assigned to: CPU Architecture:Any

[26 Jun 2020 7:58] Øystein Grøvlen
Description:
If an IN list is longer than eq_range_index_dive_limit, index dives will be skipped and the range estimate will be based on index statistics.  The estimate will then be based on the average number of rows per value.  Histograms will give much more accurate estimates.  Lets consider this query:

    select avg(population) from city where countrycode IN ('CHN', 'IND', 'JPN');

We see that the number of rows that are accessed is 952:

mysql>  select sql_text, timer_wait/1000000000.0 "time (ms)", rows_affected, rows_sent, rows_examined from  performance_schema.events_statements_history order by timer_start desc limit 1;
+-----------------------------------------------------------------------------+-----------+---------------+-----------+---------------+
| sql_text                                                                    | time (ms) | rows_affected | rows_sent | rows_examined |
+-----------------------------------------------------------------------------+-----------+---------------+-----------+---------------+
| select avg(population) from city where countrycode IN ('CHN', 'IND', 'JPN') |    2.2094 |             0 |         1 |           952 |
+-----------------------------------------------------------------------------+-----------+---------------+-----------+---------------+

In this case, the index dive estimate is very accurate:

mysql> explain select avg(population) from city where countrycode IN ('CHN', 'IND', 'JPN');
+----+-------------+-------+------------+-------+---------------+-------------+---------+------+------+----------+-----------------------+
| id | select_type | table | partitions | type  | possible_keys | key         | key_len | ref  | rows | filtered | Extra                 |
+----+-------------+-------+------------+-------+---------------+-------------+---------+------+------+----------+-----------------------+
|  1 | SIMPLE      | city  | NULL       | range | CountryCode   | CountryCode | 3       | NULL |  952 |   100.00 | Using index condition |
+----+-------------+-------+------------+-------+---------------+-------------+---------+------+------+----------+-----------------------+

If disabling the index, so that estimates based on histograms are used, we see that the histogram estimate is also pretty accurate (23.34% of 4188 is 977):

mysql> explain select /*+ NO_INDEX(city) */ avg(population) from city where countrycode IN ('CHN', 'IND', 'JPN');
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | city  | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 4188 |    23.34 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+

If we set eq_range_index_dive_limit to 2 (smaller than the IN list of 3 elements), we get an estimate of 54:
mysql> explain select avg(population) from city where countrycode IN ('CHN', 'IND', 'JPN');
+----+-------------+-------+------------+-------+---------------+-------------+---------+------+------+----------+-----------------------+
| id | select_type | table | partitions | type  | possible_keys | key         | key_len | ref  | rows | filtered | Extra                 |
+----+-------------+-------+------------+-------+---------------+-------------+---------+------+------+----------+-----------------------+
|  1 | SIMPLE      | city  | NULL       | range | CountryCode   | CountryCode | 3       | NULL |   54 |   100.00 | Using index condition |
+----+-------------+-------+------------+-------+---------------+-------------+---------+------+------+----------+-----------------------+

The estimate is off because the distribution of values are not uniform (some countries have more big cities than others)

How to repeat:
With the world database:

analyze table city;
analyze table city update histogram on countrycode;

explain select avg(population) from city where countrycode IN ('CHN', 'IND', 'JPN');
explain select /*+ NO_INDEX(city) */ avg(population) from city where countrycode IN ('CHN', 'IND', 'JPN');

set eq_range_index_dive_limit = 2;

explain select avg(population) from city where countrycode IN ('CHN', 'IND', 'JPN');

Suggested fix:
In handler::multi_range_read_info_const() check if histogram selectivity can be used before resorting rec_per_key.
[26 Jun 2020 8:42] MySQL Verification Team
Hello Øystein,

Thank you for the feature request!

regards,
Umesh