Description:
For the query 19 in TPC-DS benchmark, the query plan and performance:
EXPLAIN
-> Limit: 100 row(s) (actual time=445..445 rows=100 loops=1)
-> Sort: ext_price DESC, item.i_brand, item.i_brand_id, item.i_manufact_id, item.i_manufact, limit input to 100 row(s) per chunk (actual time=445..445 rows=100 loops=1)
-> Table scan on <temporary> (actual time=445..445 rows=168 loops=1)
-> Aggregate using temporary table (actual time=445..445 rows=168 loops=1)
-> Nested loop inner join (cost=79954 rows=1242) (actual time=0.483..443 rows=1594 loops=1)
-> Nested loop inner join (cost=79571 rows=1242) (actual time=0.473..438 rows=1594 loops=1)
-> Nested loop inner join (cost=79098 rows=1242) (actual time=0.465..433 rows=1607 loops=1)
-> Nested loop inner join (cost=56396 rows=24844) (actual time=0.444..335 rows=49715 loops=1)
-> Inner hash join (no condition) (cost=3862 rows=2108) (actual time=0.118..9.57 rows=3924 loops=1)
-> Filter: (item.i_manager_id = 7) (cost=161 rows=1757) (actual time=0.0689..9.06 rows=327 loops=1)
-> Table scan on item (cost=161 rows=17565) (actual time=0.0588..8.29 rows=18000 loops=1)
-> Hash
-> Table scan on store (cost=1.45 rows=12) (actual time=0.0299..0.0356 rows=12 loops=1)
-> Filter: ((store_sales.ss_store_sk = store.s_store_sk) and (store_sales.ss_sold_date_sk is not null) and (store_sales.ss_customer_sk is not null)) (cost=1.31 rows=11.8) (actual time=0.0571..0.0822 rows=12.7 loops=3924)
-> Index lookup on store_sales using PRIMARY (ss_item_sk=item.i_item_sk) (cost=1.31 rows=118) (actual time=0.0318..0.073 rows=159 loops=3924)
-> Filter: ((date_dim.d_year = 1999) and (date_dim.d_moy = 11)) (cost=0.0814 rows=0.05) (actual time=0.00187..0.00187 rows=0.0323 loops=49715)
-> Single-row index lookup on date_dim using PRIMARY (d_date_sk=store_sales.ss_sold_date_sk) (cost=0.0814 rows=1) (actual time=0.00166..0.00168 rows=0.977 loops=49715)
-> Filter: (customer.c_current_addr_sk is not null) (cost=0.0281 rows=1) (actual time=0.00288..0.00296 rows=0.992 loops=1607)
-> Single-row index lookup on customer using PRIMARY (c_customer_sk=store_sales.ss_customer_sk) (cost=0.0281 rows=1) (actual time=0.00273..0.00276 rows=0.992 loops=1607)
-> Filter: (substr(customer_address.ca_zip,1,5) <> substr(store.s_zip,1,5)) (cost=0.0208 rows=1) (actual time=0.0027..0.00278 rows=1 loops=1594)
-> Single-row index lookup on customer_address using PRIMARY (ca_address_sk=customer.c_current_addr_sk) (cost=0.0208 rows=1) (actual time=0.00235..0.00237 rows=1 loops=1594)
After applying this patch:
diff --git a/sql/sql_planner.cc b/sql/sql_planner.cc
index ad0c639abca..36a1cd3d2de 100644
--- a/sql/sql_planner.cc
+++ b/sql/sql_planner.cc
@@ -1492,7 +1492,7 @@ float calculate_condition_filter(const JOIN_TAB *const tab,
effects on DBT-3 was observed when removing it, so keeping
it for now.
*/
- if ((filter * fanout) < 0.05F) filter = 0.05F / static_cast<float>(fanout);
+ if (!((filter * fanout) < 0.05F)) filter = 0.05F / static_cast<float>(fanout);
cleanup:
filtering_effect_trace.end();
The performance is significantly improved:
EXPLAIN
-> Limit: 100 row(s) (actual time=236..236 rows=100 loops=1)
-> Sort: ext_price DESC, item.i_brand, item.i_brand_id, item.i_manufact_id, item.i_manufact, limit input to 100 row(s) per chunk (actual time=236..236 rows=100 loops=1)
-> Table scan on <temporary> (actual time=236..236 rows=168 loops=1)
-> Aggregate using temporary table (actual time=236..236 rows=168 loops=1)
-> Nested loop inner join (cost=1925 rows=62.5e-9) (actual time=0.291..234 rows=1594 loops=1)
-> Nested loop inner join (cost=1925 rows=1.25e-6) (actual time=0.277..229 rows=1594 loops=1)
-> Nested loop inner join (cost=1925 rows=25e-6) (actual time=0.268..224 rows=1607 loops=1)
-> Nested loop inner join (cost=1925 rows=0.0025) (actual time=0.213..117 rows=49715 loops=1)
-> Nested loop inner join (cost=1925 rows=0.0025) (actual time=0.204..46.3 rows=52005 loops=1)
-> Filter: (item.i_manager_id = 7) (cost=1925 rows=0.05) (actual time=0.115..9.71 rows=327 loops=1)
-> Table scan on item (cost=1925 rows=17565) (actual time=0.103..8.89 rows=18000 loops=1)
-> Filter: ((store_sales.ss_store_sk is not null) and (store_sales.ss_sold_date_sk is not null) and (store_sales.ss_customer_sk is not null)) (cost=1.47 rows=0.05) (actual time=0.0358..0.101 rows=159 loops=327)
-> Index lookup on store_sales using PRIMARY (ss_item_sk=item.i_item_sk) (cost=1.47 rows=118) (actual time=0.0356..0.0853 rows=159 loops=327)
-> Single-row index lookup on store using PRIMARY (s_store_sk=store_sales.ss_store_sk) (cost=2.25 rows=1) (actual time=0.00118..0.00121 rows=0.956 loops=52005)
-> Filter: ((date_dim.d_year = 1999) and (date_dim.d_moy = 11)) (cost=8.98 rows=0.01) (actual time=0.00204..0.00204 rows=0.0323 loops=49715)
-> Single-row index lookup on date_dim using PRIMARY (d_date_sk=store_sales.ss_sold_date_sk) (cost=8.98 rows=1) (actual time=0.00181..0.00184 rows=0.977 loops=49715)
-> Filter: (customer.c_current_addr_sk is not null) (cost=4001 rows=0.05) (actual time=0.00291..0.003 rows=0.992 loops=1607)
-> Single-row index lookup on customer using PRIMARY (c_customer_sk=store_sales.ss_customer_sk) (cost=4001 rows=1) (actual time=0.00275..0.00277 rows=0.992 loops=1607)
-> Filter: (substr(customer_address.ca_zip,1,5) <> substr(store.s_zip,1,5)) (cost=80000 rows=0.05) (actual time=0.00276..0.00285 rows=1 loops=1594)
-> Single-row index lookup on customer_address using PRIMARY (ca_address_sk=customer.c_current_addr_sk) (cost=80000 rows=1) (actual time=0.00237..0.0024 rows=1 loops=1594)
The second plan uses nested loop inner join, instead of index hash join. It seems the second query plan is more efficient, but is not chosen.
How to repeat:
I dumped the entire database: https://drive.google.com/file/d/1mCwATIQtNwYftxVvMCNz6xgjvRDCrhtq/view?usp=sharing
Compile MySQL in two versions, one is original and the other is with the changed IF condition. Then compare the execution of the below query on both versions.
TPC-DS query 19:
select i_brand_id brand_id, i_brand brand, i_manufact_id, i_manufact,
sum(ss_ext_sales_price) ext_price
from date_dim, store_sales, item,customer,customer_address,store
where d_date_sk = ss_sold_date_sk
and ss_item_sk = i_item_sk
and i_manager_id=7
and d_moy=11
and d_year=1999
and ss_customer_sk = c_customer_sk
and c_current_addr_sk = ca_address_sk
and substr(ca_zip,1,5) <> substr(s_zip,1,5)
and ss_store_sk = s_store_sk
group by i_brand
,i_brand_id
,i_manufact_id
,i_manufact
order by ext_price desc
,i_brand
,i_brand_id
,i_manufact_id
,i_manufact
limit 100 ;