Description:
When optimizer evaluates various plans, at every stage a new table is added to a plan, criteria that contributes to condition filtering are found, and they are given heuristics values or histogram (from 8.0 if available) or other ways, to find how many rows pass the criteria. But it does not perform range reduction that is performed while accessing through index in range optimization, which causes to over estimate some times by double dipping.
How to repeat:
In a query like below
```sql
select
e.emp_no,
concat(e.first_name, ' ', e.last_name),
d.dept_name,
t.title_name,
es.salary,
es.insurance,
es.pf,
ea.city,
ea.state,
ea.phone
from employee e
join emp_salary es on e.emp_no = es.emp_no
join emp_title et on et.emp_no = e.emp_no
join title t on t.title_no = et.title_no
join emp_address_phone ea on ea.emp_no = e.emp_no
join emp_dept ed on e.emp_no = ed.emp_no
join department d on d.dept_no = ed.dept_no
where
(e.hire_date > '2020-01-01' or e.hire_date < '1972-01-01' or e.hire_date < '1971-01-01')
and (t.title_created > '2006-01-01')
and (ea.country = 'Spain' or ea.country = 'Samoa' or ea.country = 'India');
```
Access happens through the criteria on `ea.country`, and when `e` is added to join order, condition filtering is applied for the criteria on `e.hire_date`.
When there is histogram (in version 8.0):
"filtering_effect": [
{
"condition": "(`e`.`hire_date` > DATE'2020-01-01')",
"histogram_selectivity": 0.0085
},
{
"condition": "(`e`.`hire_date` < DATE'1972-01-01')",
"histogram_selectivity": 0.0426
},
{
"condition": "(`e`.`hire_date` < DATE'1971-01-01')",
"histogram_selectivity": 0.0204
}
],
"final_filtering_effect": 0.0701
The value (0.0701) of `final_filtering_effect` comes from the probability formula, i.e.
final_filtering_effect = 0.0085 + 0.0426 + 0.0204 - (0.0085 * 0.0426) - (0.0426 * 0.0204) - (0.0085 * 0.0204) + (0.0085 * 0.0426 * 0.0204)
final_filtering_effect = 0.0701
In collates the probability of all the ranges, without considering to reduce range. This causes double dipping of number of rows for the values that are less than '1971-01-01' as they are included twice.
If the `final_filtering_effect` was calculated only using the reduced range, the value would have been
final_filtering_effect = 0.0085 + 0.0426 - (0.0085 * 0.0426)
final_filtering_effect = 0.0507
There are total of `10000` rows in the relation, hence the condition seems to filter around `507` rows, and if we see the real count, that seems to align right. It is `506` and not `701`.
mysql > select count(*) from employee where hire_date > '2020-01-01' or hire_date < '1972-01-01' or hire_date < '1971-01-01';
+----------+
| count(*) |
+----------+
| 506 |
+----------+
When there is no histogram, just going on with heuristics of 33.33 percent for `col [non eq op] val` criteria, the value is double dipped,
"filtering_effect": [
],
"final_filtering_effect": 0.7037
Suggested fix:
Performing possible range reduction and simplification on the fields that have more than one criterion, that is independent.