MySQL Bugs: #99690: Range reduction for criteria that contributes to condition filtering factor

Bug #99690	Range reduction for criteria that contributes to condition filtering factor
Submitted:	26 May 2020 4:33	Modified:	3 Jun 2020 13:02
Reporter:	Arun S R	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S4 (Feature request)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
When optimizer evaluates various plans, at every stage a new table is added to a plan, criteria that contributes to condition filtering are found, and they are given heuristics values or histogram (from 8.0 if available) or other ways, to find how many rows pass the criteria. But it does not perform range reduction that is performed while accessing through index in range optimization, which causes to over estimate some times by double dipping. 

How to repeat:
In a query like below

```sql
select  
e.emp_no,  
concat(e.first_name, ' ', e.last_name),  
d.dept_name,  
t.title_name,  
es.salary,  
es.insurance,  
es.pf,  
ea.city,  
ea.state,  
ea.phone  
from employee e  
join emp_salary es on e.emp_no = es.emp_no  
join emp_title et on et.emp_no = e.emp_no  
join title t on t.title_no = et.title_no  
join emp_address_phone ea on ea.emp_no = e.emp_no  
join emp_dept ed on e.emp_no = ed.emp_no  
join department d on d.dept_no = ed.dept_no  
where  
(e.hire_date > '2020-01-01' or e.hire_date < '1972-01-01' or e.hire_date < '1971-01-01') 
and (t.title_created > '2006-01-01') 
and (ea.country = 'Spain' or ea.country = 'Samoa' or ea.country = 'India');
```

Access happens through the criteria on `ea.country`, and when `e` is added to join order, condition filtering is applied for the criteria on `e.hire_date`.

When there is histogram (in version 8.0):

"filtering_effect": [
  {
    "condition": "(`e`.`hire_date` > DATE'2020-01-01')",
    "histogram_selectivity": 0.0085
  },
  {
    "condition": "(`e`.`hire_date` < DATE'1972-01-01')",
    "histogram_selectivity": 0.0426
  },
  {
    "condition": "(`e`.`hire_date` < DATE'1971-01-01')",
    "histogram_selectivity": 0.0204
  }
],
"final_filtering_effect": 0.0701

The value (0.0701) of `final_filtering_effect` comes from the probability formula, i.e.

final_filtering_effect = 0.0085 + 0.0426 + 0.0204 - (0.0085 * 0.0426) - (0.0426 * 0.0204) - (0.0085 * 0.0204) + (0.0085 * 0.0426 * 0.0204)
final_filtering_effect = 0.0701

In collates the probability of all the ranges, without considering to reduce range. This causes double dipping of number of rows for the values that are less than '1971-01-01' as they are included twice.

If the `final_filtering_effect` was calculated only using the reduced range, the value would have been

final_filtering_effect = 0.0085 + 0.0426 - (0.0085 * 0.0426)
final_filtering_effect = 0.0507

There are total of `10000` rows in the relation, hence the condition seems to filter around `507` rows, and if we see the real count, that seems to align right. It is `506` and not `701`.

mysql > select count(*) from employee where hire_date > '2020-01-01' or hire_date < '1972-01-01' or hire_date < '1971-01-01';
+----------+
| count(*) |
+----------+
|      506 |
+----------+

When there is no histogram, just going on with heuristics of 33.33 percent for `col [non eq op] val` criteria, the value is double dipped, 

"filtering_effect": [
],
"final_filtering_effect": 0.7037

Suggested fix:
Performing possible range reduction and simplification on the fields that have more than one criterion, that is independent.

Sorry, can you send us all data for this test case ???/

Dump of data on which the query can be run. Please run in order so there is no foreign key error.

Attachment: Dump20200603.zip (application/zip, text), 1.34 MiB.

Thank you, Mr. Arun.