Description:
For TPC-DS benchmark query 15:
select ca_zip
,sum(cs_sales_price)
from catalog_sales
,customer
,customer_address
,date_dim
where cs_bill_customer_sk = c_customer_sk
and c_current_addr_sk = ca_address_sk
and ( substr(ca_zip,1,5) in ('85669', '86197','88274','83405','86475',
'85392', '85460', '80348', '81792')
or ca_state in ('CA','WA','GA')
or cs_sales_price > 500)
and cs_sold_date_sk = d_date_sk
and d_qoy = 2 and d_year = 2000
group by ca_zip
order by ca_zip
limit 100;
The execution time and query plan:
EXPLAIN
-> Limit: 100 row(s) (actual time=3840..3840 rows=100 loops=1)
-> Sort: customer_address.ca_zip, limit input to 100 row(s) per chunk (actual time=3840..3840 rows=100 loops=1)
-> Table scan on <temporary> (actual time=3840..3840 rows=202 loops=1)
-> Aggregate using temporary table (actual time=3840..3840 rows=202 loops=1)
-> Nested loop inner join (cost=1.71e+6 rows=69538) (actual time=15.9..3837 rows=3444 loops=1)
-> Nested loop inner join (cost=1.69e+6 rows=69538) (actual time=14.2..3715 rows=41625 loops=1)
-> Nested loop inner join (cost=1.65e+6 rows=69538) (actual time=14.1..3449 rows=41739 loops=1)
-> Filter: ((catalog_sales.cs_sold_date_sk is not null) and (catalog_sales.cs_bill_customer_sk is not null)) (cost=152236 rows=1.39e+6) (actual time=13.8..1076 rows=1.44e+6 loops=1)
-> Table scan on catalog_sales (cost=152236 rows=1.39e+6) (actual time=13.8..981 rows=1.44e+6 loops=1)
-> Filter: ((date_dim.d_year = 2000) and (date_dim.d_qoy = 2)) (cost=0.979 rows=0.05) (actual time=0.00157..0.00157 rows=0.029 loops=1.44e+6)
-> Single-row index lookup on date_dim using PRIMARY (d_date_sk=catalog_sales.cs_sold_date_sk) (cost=0.979 rows=1) (actual time=0.00138..0.0014 rows=0.995 loops=1.44e+6)
-> Filter: (customer.c_current_addr_sk is not null) (cost=0.374 rows=1) (actual time=0.00615..0.00622 rows=0.997 loops=41739)
-> Single-row index lookup on customer using PRIMARY (c_customer_sk=catalog_sales.cs_bill_customer_sk) (cost=0.374 rows=1) (actual time=0.00602..0.00604 rows=0.997 loops=41739)
-> Filter: ((substr(customer_address.ca_zip,1,5) in ('85669','86197','88274','83405','86475','85392','85460','80348','81792')) or (customer_address.ca_state in ('CA','WA','GA')) or (catalog_sales.cs_sales_price > 500.00)) (cost=0.258 rows=1) (actual time=0.00282..0.00283 rows=0.0827 loops=41625)
-> Single-row index lookup on customer_address using PRIMARY (ca_address_sk=customer.c_current_addr_sk) (cost=0.258 rows=1) (actual time=0.00223..0.00225 rows=1 loops=41625)
If we apply this patch:
diff --git a/sql/sql_planner.cc b/sql/sql_planner.cc
index ad0c639abca..7cf5880f29a 100644
--- a/sql/sql_planner.cc
+++ b/sql/sql_planner.cc
@@ -210,7 +210,7 @@ Key_use *Optimize_table_order::find_best_ref(
const double prefix_rowcount, bool *found_condition,
table_map *ref_depend_map, uint *used_key_parts) {
// Skip finding best_ref if quick object is forced by hint.
- if (tab->range_scan() && get_forced_by_hint(tab->range_scan()))
+ if (!(tab->range_scan() && get_forced_by_hint(tab->range_scan())))
return nullptr;
// Return value - will point to Key_use of the index with cheapest ref access
The performance is significantly improved:
EXPLAIN
-> Limit: 100 row(s) (actual time=1615..1615 rows=100 loops=1)
-> Sort: customer_address.ca_zip, limit input to 100 row(s) per chunk (actual time=1615..1615 rows=100 loops=1)
-> Table scan on <temporary> (actual time=1615..1615 rows=202 loops=1)
-> Aggregate using temporary table (actual time=1615..1615 rows=202 loops=1)
-> Filter: ((substr(customer_address.ca_zip,1,5) in ('85669','86197','88274','83405','86475','85392','85460','80348','81792')) or (customer_address.ca_state in ('CA','WA','GA')) or (catalog_sales.cs_sales_price > 500.00)) (cost=1.5e+6 rows=19.2) (actual time=1508..1613 rows=3444 loops=1)
-> Inner hash join (customer_address.ca_address_sk = customer.c_current_addr_sk) (cost=1.5e+6 rows=19.2) (actual time=1508..1601 rows=41625 loops=1)
-> Table scan on customer_address (cost=0.0245 rows=49649) (actual time=0.231..38.1 rows=50000 loops=1)
-> Hash
-> Inner hash join (customer.c_customer_sk = catalog_sales.cs_bill_customer_sk) (cost=1.37e+6 rows=19.2) (actual time=1402..1501 rows=41625 loops=1)
-> Covering index scan on customer using c_a (cost=0.185 rows=98494) (actual time=1.08..27.6 rows=100000 loops=1)
-> Hash
-> Inner hash join (date_dim.d_date_sk = catalog_sales.cs_sold_date_sk) (cost=919916 rows=19.2) (actual time=1162..1395 rows=41739 loops=1)
-> Filter: ((date_dim.d_year = 2000) and (date_dim.d_qoy = 2)) (cost=0.453 rows=1) (actual time=21.3..45.8 rows=91 loops=1)
-> Table scan on date_dim (cost=0.453 rows=72468) (actual time=0.289..42.5 rows=73049 loops=1)
-> Hash
-> Table scan on catalog_sales (cost=150221 rows=1.39e+6) (actual time=13.6..960 rows=1.44e+6 loops=1)
Similarly, I am wondering whether we can optimize any logic to enable the second query plan, which seems much more efficient?
How to repeat:
I dumped the entire database: https://drive.google.com/file/d/1mCwATIQtNwYftxVvMCNz6xgjvRDCrhtq/view?usp=sharing
Compile MySQL in two versions, one is original and the other is with the changed IF condition. Then compare the execution of above query on both versions.