Bug #116774 Performance of TPC-DS Query 15
Submitted: 25 Nov 2024 18:00 Modified: 28 Nov 2024 8:00
Reporter: JINSHENG BA Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S5 (Performance)
Version:596f0d23 (9.0.0), 8.0.40 OS:Any
Assigned to: CPU Architecture:Any

[25 Nov 2024 18:00] JINSHENG BA
Description:
For TPC-DS benchmark query 15:

select  ca_zip
       ,sum(cs_sales_price)
 from catalog_sales
     ,customer
     ,customer_address
     ,date_dim
 where cs_bill_customer_sk = c_customer_sk
 	and c_current_addr_sk = ca_address_sk 
 	and ( substr(ca_zip,1,5) in ('85669', '86197','88274','83405','86475',
                                   '85392', '85460', '80348', '81792')
 	      or ca_state in ('CA','WA','GA')
 	      or cs_sales_price > 500)
 	and cs_sold_date_sk = d_date_sk
 	and d_qoy = 2 and d_year = 2000
 group by ca_zip
 order by ca_zip
 limit 100;

The execution time and query plan:
EXPLAIN
-> Limit: 100 row(s)  (actual time=3840..3840 rows=100 loops=1)
    -> Sort: customer_address.ca_zip, limit input to 100 row(s) per chunk  (actual time=3840..3840 rows=100 loops=1)
        -> Table scan on <temporary>  (actual time=3840..3840 rows=202 loops=1)
            -> Aggregate using temporary table  (actual time=3840..3840 rows=202 loops=1)
                -> Nested loop inner join  (cost=1.71e+6 rows=69538) (actual time=15.9..3837 rows=3444 loops=1)
                    -> Nested loop inner join  (cost=1.69e+6 rows=69538) (actual time=14.2..3715 rows=41625 loops=1)
                        -> Nested loop inner join  (cost=1.65e+6 rows=69538) (actual time=14.1..3449 rows=41739 loops=1)
                            -> Filter: ((catalog_sales.cs_sold_date_sk is not null) and (catalog_sales.cs_bill_customer_sk is not null))  (cost=152236 rows=1.39e+6) (actual time=13.8..1076 rows=1.44e+6 loops=1)
                                -> Table scan on catalog_sales  (cost=152236 rows=1.39e+6) (actual time=13.8..981 rows=1.44e+6 loops=1)
                            -> Filter: ((date_dim.d_year = 2000) and (date_dim.d_qoy = 2))  (cost=0.979 rows=0.05) (actual time=0.00157..0.00157 rows=0.029 loops=1.44e+6)
                                -> Single-row index lookup on date_dim using PRIMARY (d_date_sk=catalog_sales.cs_sold_date_sk)  (cost=0.979 rows=1) (actual time=0.00138..0.0014 rows=0.995 loops=1.44e+6)
                        -> Filter: (customer.c_current_addr_sk is not null)  (cost=0.374 rows=1) (actual time=0.00615..0.00622 rows=0.997 loops=41739)
                            -> Single-row index lookup on customer using PRIMARY (c_customer_sk=catalog_sales.cs_bill_customer_sk)  (cost=0.374 rows=1) (actual time=0.00602..0.00604 rows=0.997 loops=41739)
                    -> Filter: ((substr(customer_address.ca_zip,1,5) in ('85669','86197','88274','83405','86475','85392','85460','80348','81792')) or (customer_address.ca_state in ('CA','WA','GA')) or (catalog_sales.cs_sales_price > 500.00))  (cost=0.258 rows=1) (actual time=0.00282..0.00283 rows=0.0827 loops=41625)
                        -> Single-row index lookup on customer_address using PRIMARY (ca_address_sk=customer.c_current_addr_sk)  (cost=0.258 rows=1) (actual time=0.00223..0.00225 rows=1 loops=41625)

If we apply this patch:

diff --git a/sql/sql_planner.cc b/sql/sql_planner.cc
index ad0c639abca..7cf5880f29a 100644
--- a/sql/sql_planner.cc
+++ b/sql/sql_planner.cc
@@ -210,7 +210,7 @@ Key_use *Optimize_table_order::find_best_ref(
     const double prefix_rowcount, bool *found_condition,
     table_map *ref_depend_map, uint *used_key_parts) {
   // Skip finding best_ref if quick object is forced by hint.
-  if (tab->range_scan() && get_forced_by_hint(tab->range_scan()))
+  if (!(tab->range_scan() && get_forced_by_hint(tab->range_scan())))
     return nullptr;
 
   // Return value - will point to Key_use of the index with cheapest ref access

The performance is significantly improved:

EXPLAIN
-> Limit: 100 row(s)  (actual time=1615..1615 rows=100 loops=1)
    -> Sort: customer_address.ca_zip, limit input to 100 row(s) per chunk  (actual time=1615..1615 rows=100 loops=1)
        -> Table scan on <temporary>  (actual time=1615..1615 rows=202 loops=1)
            -> Aggregate using temporary table  (actual time=1615..1615 rows=202 loops=1)
                -> Filter: ((substr(customer_address.ca_zip,1,5) in ('85669','86197','88274','83405','86475','85392','85460','80348','81792')) or (customer_address.ca_state in ('CA','WA','GA')) or (catalog_sales.cs_sales_price > 500.00))  (cost=1.5e+6 rows=19.2) (actual time=1508..1613 rows=3444 loops=1)
                    -> Inner hash join (customer_address.ca_address_sk = customer.c_current_addr_sk)  (cost=1.5e+6 rows=19.2) (actual time=1508..1601 rows=41625 loops=1)
                        -> Table scan on customer_address  (cost=0.0245 rows=49649) (actual time=0.231..38.1 rows=50000 loops=1)
                        -> Hash
                            -> Inner hash join (customer.c_customer_sk = catalog_sales.cs_bill_customer_sk)  (cost=1.37e+6 rows=19.2) (actual time=1402..1501 rows=41625 loops=1)
                                -> Covering index scan on customer using c_a  (cost=0.185 rows=98494) (actual time=1.08..27.6 rows=100000 loops=1)
                                -> Hash
                                    -> Inner hash join (date_dim.d_date_sk = catalog_sales.cs_sold_date_sk)  (cost=919916 rows=19.2) (actual time=1162..1395 rows=41739 loops=1)
                                        -> Filter: ((date_dim.d_year = 2000) and (date_dim.d_qoy = 2))  (cost=0.453 rows=1) (actual time=21.3..45.8 rows=91 loops=1)
                                            -> Table scan on date_dim  (cost=0.453 rows=72468) (actual time=0.289..42.5 rows=73049 loops=1)
                                        -> Hash
                                            -> Table scan on catalog_sales  (cost=150221 rows=1.39e+6) (actual time=13.6..960 rows=1.44e+6 loops=1)

Similarly, I am wondering whether we can optimize any logic to enable the second query plan, which seems much more efficient?

How to repeat:
I dumped the entire database: https://drive.google.com/file/d/1mCwATIQtNwYftxVvMCNz6xgjvRDCrhtq/view?usp=sharing

Compile MySQL in two versions, one is original and the other is with the changed IF condition. Then compare the execution of above query on both versions.
[27 Nov 2024 7:03] MySQL Verification Team
Hello JINSHENG BA,

Thank you for the report and feedback.
I'm trying to reproduce the issue with provided details and get back to you if anything further needed.
I assume you have tested against the default server settings(if this is not the case then please attach the configuration details to the report).

regards,
Umesh
[28 Nov 2024 8:00] MySQL Verification Team
Hello JINSHENG BA,

Thank you for the report and details.
Observed this with 8.0.40 build.

regards,
Umesh