MySQL Bugs: #116775: Performance of TPC-DS Query 16

Bug #116775	Performance of TPC-DS Query 16
Submitted:	25 Nov 2024 18:34	Modified:	28 Nov 2024 8:00
Reporter:	JINSHENG BA	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S5 (Performance)
Version:	596f0d23 (9.0.0), 8.0.40	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
For TPC-DS Query 16, its execution time and query plan:

EXPLAIN
-> Limit: 100 row(s)  (cost=85.5e+6 rows=1) (actual time=5641..5641 rows=1 loops=1)
    -> Aggregate: count(distinct cs1.cs_order_number), sum(cs1.cs_ext_ship_cost), sum(cs1.cs_net_profit)  (cost=85.5e+6 rows=1) (actual time=5641..5641 rows=1 loops=1)
        -> Nested loop antijoin  (cost=72.1e+6 rows=58e+6) (actual time=1645..5641 rows=495 loops=1)
            -> Hash semijoin (cs2.cs_order_number = cs1.cs_order_number), extra conditions: (cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk)  (cost=66.3e+6 rows=58e+6) (actual time=1291..5558 rows=1463 loops=1)
                -> Nested loop inner join  (cost=106996 rows=464) (actual time=0.97..4129 rows=1463 loops=1)
                    -> Nested loop inner join  (cost=92982 rows=4172) (actual time=0.951..4028 rows=46226 loops=1)
                        -> Inner hash join (cs1.cs_call_center_sk = call_center.cc_call_center_sk)  (cost=54041 rows=41723) (actual time=0.802..1355 rows=1.43e+6 loops=1)
                            -> Filter: ((cs1.cs_ship_addr_sk is not null) and (cs1.cs_ship_date_sk is not null))  (cost=8741 rows=139076) (actual time=0.482..1101 rows=1.44e+6 loops=1)
                                -> Table scan on cs1  (cost=8741 rows=1.39e+6) (actual time=0.48..995 rows=1.44e+6 loops=1)
                            -> Hash
                                -> Filter: (call_center.cc_county in ('Williamson County','Williamson County','Williamson County','Williamson County','Williamson County'))  (cost=1.6 rows=3) (actual time=0.285..0.292 rows=6 loops=1)
                                    -> Table scan on call_center  (cost=1.6 rows=6) (actual time=0.279..0.282 rows=6 loops=1)
                        -> Filter: (customer_address.ca_state = 'IL')  (cost=0.0833 rows=0.1) (actual time=0.00178..0.00178 rows=0.0322 loops=1.43e+6)
                            -> Single-row index lookup on customer_address using PRIMARY (ca_address_sk=cs1.cs_ship_addr_sk)  (cost=0.0833 rows=1) (actual time=0.00157..0.0016 rows=0.997 loops=1.43e+6)
                    -> Filter: (date_dim.d_date between '1999-2-01' and <cache>((cast('1999-2-01' as date) + interval 60 day)))  (cost=0.326 rows=0.111) (actual time=0.00208..0.00208 rows=0.0316 loops=46226)
                        -> Single-row index lookup on date_dim using PRIMARY (d_date_sk=cs1.cs_ship_date_sk)  (cost=0.326 rows=1) (actual time=0.00167..0.00169 rows=0.999 loops=46226)
                -> Hash
                    -> Table scan on cs2  (cost=1.76e+6 rows=1.39e+6) (actual time=13.8..1012 rows=1.44e+6 loops=1)
            -> Filter: (cs2.cs_order_number = cs1.cs_order_number)  (cost=0.00324..0.00324 rows=1) (actual time=0.0564..0.0564 rows=0.662 loops=1463)
                -> Single-row index lookup on <subquery3> using <auto_distinct_key> (cr_order_number=cs1.cs_order_number)  (cost=48483..48483 rows=1) (actual time=0.0563..0.0563 rows=0.662 loops=1463)
                    -> Materialize with deduplication  (cost=48483..48483 rows=142945) (actual time=81.7..81.7 rows=94519 loops=1)
                        -> Filter: (cr1.cr_order_number is not null)  (cost=15546 rows=142945) (actual time=0.368..52.2 rows=144067 loops=1)
                            -> Covering index scan on cr1 using cr_d1  (cost=15546 rows=142945) (actual time=0.367..44.6 rows=144067 loops=1)

If we apply the patch:

diff --git a/sql/sql_optimizer.cc b/sql/sql_optimizer.cc
index 2b8f6c523aa..5521c8d9ba5 100644
--- a/sql/sql_optimizer.cc
+++ b/sql/sql_optimizer.cc
@@ -8469,7 +8469,7 @@ static bool update_ref_and_keys(THD *thd, Key_use_array *keyuse,
       prev = use;
       found_eq_constant = use->val->const_for_execution();
       /* Save ptr to first use */
-      if (!table->reginfo.join_tab->keyuse())
+      if (table->reginfo.join_tab->keyuse())
         table->reginfo.join_tab->set_keyuse(save_pos);
       table->reginfo.join_tab->checked_keys.set_bit(use->key);
       save_pos++;

The execution time and query plan:

EXPLAIN
-> Limit: 100 row(s)  (cost=7.15e+6 rows=1) (actual time=3404..3404 rows=1 loops=1)
    -> Aggregate: count(distinct cs1.cs_order_number), sum(cs1.cs_ext_ship_cost), sum(cs1.cs_net_profit)  (cost=7.15e+6 rows=1) (actual time=3404..3404 rows=1 loops=1)
        -> Nested loop antijoin  (cost=7.15e+6 rows=14.5) (actual time=3356..3403 rows=495 loops=1)
            -> Inner hash join (date_dim.d_date_sk = cs1.cs_ship_date_sk)  (cost=7.11e+6 rows=14.5) (actual time=3277..3324 rows=1463 loops=1)
                -> Filter: (date_dim.d_date between '1999-2-01' and <cache>((cast('1999-2-01' as date) + interval 60 day)))  (cost=4.73 rows=1) (actual time=37.7..63.9 rows=61 loops=1)
                    -> Table scan on date_dim  (cost=4.73 rows=72468) (actual time=13.5..52.8 rows=73049 loops=1)
                -> Hash
                    -> Hash semijoin (cs2.cs_order_number = cs1.cs_order_number), extra conditions: (cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk)  (cost=5.03e+6 rows=1.05e+6) (actual time=2617..3225 rows=46218 loops=1)
                        -> Inner hash join (customer_address.ca_address_sk = cs1.cs_ship_addr_sk)  (cost=1.66e+6 rows=8.4) (actual time=1503..1903 rows=46226 loops=1)
                            -> Filter: (customer_address.ca_state = 'IL')  (cost=2.85 rows=1) (actual time=0.197..31.3 rows=1596 loops=1)
                                -> Table scan on customer_address  (cost=2.85 rows=49649) (actual time=0.188..28.1 rows=50000 loops=1)
                            -> Hash
                                -> Inner hash join (cs1.cs_call_center_sk = call_center.cc_call_center_sk)  (cost=427870 rows=417228) (actual time=0.47..1230 rows=1.43e+6 loops=1)
                                    -> Table scan on cs1  (cost=8183 rows=1.39e+6) (actual time=0.159..1009 rows=1.44e+6 loops=1)
                                    -> Hash
                                        -> Filter: (call_center.cc_county in ('Williamson County','Williamson County','Williamson County','Williamson County','Williamson County'))  (cost=1.6 rows=3) (actual time=0.275..0.281 rows=6 loops=1)
                                            -> Table scan on call_center  (cost=1.6 rows=6) (actual time=0.269..0.271 rows=6 loops=1)
                        -> Hash
                            -> Table scan on cs2  (cost=894759 rows=1.39e+6) (actual time=13.5..930 rows=1.44e+6 loops=1)
            -> Filter: (cs2.cs_order_number = cs1.cs_order_number)  (cost=0.292..0.292 rows=1) (actual time=0.0545..0.0545 rows=0.662 loops=1463)
                -> Single-row index lookup on <subquery3> using <auto_distinct_key> (cr_order_number=cs1.cs_order_number)  (cost=48483..48483 rows=1) (actual time=0.0544..0.0544 rows=0.662 loops=1463)
                    -> Materialize with deduplication  (cost=48483..48483 rows=142945) (actual time=79.1..79.1 rows=94519 loops=1)
                        -> Filter: (cr1.cr_order_number is not null)  (cost=15546 rows=142945) (actual time=0.309..49.5 rows=144067 loops=1)
                            -> Covering index scan on cr1 using cr_d1  (cost=15546 rows=142945) (actual time=0.308..41.9 rows=144067 loops=1)

The performance is improved by around 50%. I think the key difference between both query plans is the joining order. Similarlly, the second joining order is more efficient, so whether we can optimize any code to enable it?

How to repeat:
I dumped the entire database: https://drive.google.com/file/d/1mCwATIQtNwYftxVvMCNz6xgjvRDCrhtq/view?usp=sharing

Compile MySQL in two versions, one is original and the other is with the changed IF condition. Then compare the execution of the below query on both versions.

TPC-DS query 16:
select  
   count(distinct cs_order_number) as "order count"
  ,sum(cs_ext_ship_cost) as "total shipping cost"
  ,sum(cs_net_profit) as "total net profit"
from
   catalog_sales cs1
  ,date_dim
  ,customer_address
  ,call_center
where
    d_date between '1999-2-01' and 
           DATE_ADD(cast('1999-2-01' as date), interval 60 day)
and cs1.cs_ship_date_sk = d_date_sk
and cs1.cs_ship_addr_sk = ca_address_sk
and ca_state = 'IL'
and cs1.cs_call_center_sk = cc_call_center_sk
and cc_county in ('Williamson County','Williamson County','Williamson County','Williamson County',
                  'Williamson County'
)
and exists (select *
            from catalog_sales cs2
            where cs1.cs_order_number = cs2.cs_order_number
              and cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk)
and not exists(select *
               from catalog_returns cr1
               where cs1.cs_order_number = cr1.cr_order_number)
order by count(distinct cs_order_number)
limit 100;

Hello JINSHENG BA,

Thank you for the report and feedback.
I'm trying to reproduce the issue with provided details and get back to you if anything further needed.
I assume you have tested against the default server settings(if this is not the case then please attach the configuration details to the report).

regards,
Umesh

Hello JINSHENG BA,

Thank you for the report and details.
Observed this with 8.0.40 build.

regards,
Umesh