Bug #99690 Range reduction for criteria that contributes to condition filtering factor
Submitted: 26 May 2020 4:33 Modified: 3 Jun 2020 13:02
Reporter: Arun S R Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S4 (Feature request)
Version:8.0 OS:Any
Assigned to: CPU Architecture:Any

[26 May 2020 4:33] Arun S R
Description:
When optimizer evaluates various plans, at every stage a new table is added to a plan, criteria that contributes to condition filtering are found, and they are given heuristics values or histogram (from 8.0 if available) or other ways, to find how many rows pass the criteria. But it does not perform range reduction that is performed while accessing through index in range optimization, which causes to over estimate some times by double dipping. 

How to repeat:
In a query like below

```sql
select  
e.emp_no,  
concat(e.first_name, ' ', e.last_name),  
d.dept_name,  
t.title_name,  
es.salary,  
es.insurance,  
es.pf,  
ea.city,  
ea.state,  
ea.phone  
from employee e  
join emp_salary es on e.emp_no = es.emp_no  
join emp_title et on et.emp_no = e.emp_no  
join title t on t.title_no = et.title_no  
join emp_address_phone ea on ea.emp_no = e.emp_no  
join emp_dept ed on e.emp_no = ed.emp_no  
join department d on d.dept_no = ed.dept_no  
where  
(e.hire_date > '2020-01-01' or e.hire_date < '1972-01-01' or e.hire_date < '1971-01-01') 
and (t.title_created > '2006-01-01') 
and (ea.country = 'Spain' or ea.country = 'Samoa' or ea.country = 'India');
```

Access happens through the criteria on `ea.country`, and when `e` is added to join order, condition filtering is applied for the criteria on `e.hire_date`.

When there is histogram (in version 8.0):

"filtering_effect": [
  {
    "condition": "(`e`.`hire_date` > DATE'2020-01-01')",
    "histogram_selectivity": 0.0085
  },
  {
    "condition": "(`e`.`hire_date` < DATE'1972-01-01')",
    "histogram_selectivity": 0.0426
  },
  {
    "condition": "(`e`.`hire_date` < DATE'1971-01-01')",
    "histogram_selectivity": 0.0204
  }
],
"final_filtering_effect": 0.0701

The value (0.0701) of `final_filtering_effect` comes from the probability formula, i.e.

final_filtering_effect = 0.0085 + 0.0426 + 0.0204 - (0.0085 * 0.0426) - (0.0426 * 0.0204) - (0.0085 * 0.0204) + (0.0085 * 0.0426 * 0.0204)
final_filtering_effect = 0.0701

In collates the probability of all the ranges, without considering to reduce range. This causes double dipping of number of rows for the values that are less than '1971-01-01' as they are included twice.

If the `final_filtering_effect` was calculated only using the reduced range, the value would have been

final_filtering_effect = 0.0085 + 0.0426 - (0.0085 * 0.0426)
final_filtering_effect = 0.0507

There are total of `10000` rows in the relation, hence the condition seems to filter around `507` rows, and if we see the real count, that seems to align right. It is `506` and not `701`.

mysql > select count(*) from employee where hire_date > '2020-01-01' or hire_date < '1972-01-01' or hire_date < '1971-01-01';
+----------+
| count(*) |
+----------+
|      506 |
+----------+

When there is no histogram, just going on with heuristics of 33.33 percent for `col [non eq op] val` criteria, the value is double dipped, 

"filtering_effect": [
],
"final_filtering_effect": 0.7037

Suggested fix:
Performing possible range reduction and simplification on the fields that have more than one criterion, that is independent.
[3 Jun 2020 13:38] MySQL Verification Team
Sorry, can you send us all data for this test case ???/
[3 Jun 2020 15:13] Arun S R
Dump of data on which the query can be run. Please run in order so there is no foreign key error.

Attachment: Dump20200603.zip (application/zip, text), 1.34 MiB.

[4 Jun 2020 12:14] MySQL Verification Team
Thank you, Mr. Arun.