MySQL Bugs: #116309: Performance of TPC-H Query

Bug #116309	Performance of TPC-H Query
Submitted:	7 Oct 2024 10:39	Modified:	22 Oct 2024 11:32
Reporter:	JINSHENG BA	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S5 (Performance)
Version:	596f0d23 (9.0.0), 9.1.0	OS:	Ubuntu
Assigned to:		CPU Architecture:	Any

Description:
A suboptimal query plan is chosen for executing query 16 in TPC-H benchmark. 

$ mysql tpch < 16.sql
EXPLAIN
-> Sort: supplier_cnt DESC, PART.P_BRAND, PART.P_TYPE, PART.P_SIZE  (actual time=8208..8209 rows=24525 loops=1)
    -> Stream results  (cost=10.4e+9 rows=3.54e+6) (actual time=495..8161 rows=24525 loops=1)
        -> Group aggregate: count(distinct PARTSUPP.PS_SUPPKEY)  (cost=10.4e+9 rows=3.54e+6) (actual time=495..8136 rows=24525 loops=1)
            -> Nested loop antijoin  (cost=3.14e+9 rows=31.4e+9) (actual time=494..7932 rows=237868 loops=1)
                -> Nested loop inner join  (cost=324912 rows=1.58e+6) (actual time=468..7718 rows=237992 loops=1)
                    -> Sort: PART.P_BRAND, PART.P_TYPE, PART.P_SIZE  (cost=41975 rows=395604) (actual time=468..484 rows=59498 loops=1)
                        -> Filter: ((PART.P_BRAND <> 'Brand#34') and (not((PART.P_TYPE like 'LARGE BRUSHED%'))) and (PART.P_SIZE in (48,19,12,4,41,7,21,39)))  (cost=41975 rows=395604) (actual time=0.458..407 rows=59498 loops=1)
                            -> Table scan on PART  (cost=41975 rows=395604) (actual time=0.453..325 rows=400000 loops=1)
                    -> Covering index lookup on PARTSUPP using PRIMARY (PS_PARTKEY=PART.P_PARTKEY)  (cost=0.792 rows=3.98) (actual time=0.119..0.121 rows=4 loops=59498)
                -> Single-row index lookup on <subquery2> using <auto_distinct_key> (s_suppkey=PARTSUPP.PS_SUPPKEY)  (cost=6706..6706 rows=1) (actual time=661e-6..661e-6 rows=521e-6 loops=237992)
                    -> Materialize with deduplication  (cost=6703..6703 rows=19929) (actual time=26.1..26.1 rows=10 loops=1)
                        -> Filter: (SUPPLIER.S_SUPPKEY is not null)  (cost=2111 rows=19929) (actual time=0.633..26.1 rows=10 loops=1)
                            -> Filter: (SUPPLIER.S_COMMENT like '%Customer%Complaints%')  (cost=2111 rows=19929) (actual time=0.632..26.1 rows=10 loops=1)
                                -> Table scan on SUPPLIER  (cost=2111 rows=19929) (actual time=0.0232..16.4 rows=20000 loops=1)

0.06s user 0.01s system 0% cpu 8.288 total

If we disable the true branch of the IF condition: https://github.com/mysql/mysql-server/blob/trunk/sql/sql_select.cc#L3261-L3269, we can see a performance improvement:
$ mysql tpch < 16.sql
EXPLAIN
-> Sort: supplier_cnt DESC, PART.P_BRAND, PART.P_TYPE, PART.P_SIZE  (actual time=2314..2316 rows=24525 loops=1)
    -> Stream results  (actual time=2178..2294 rows=24525 loops=1)
        -> Group aggregate: count(distinct PARTSUPP.ps_suppkey)  (actual time=2178..2289 rows=24525 loops=1)
            -> Sort: PART.P_BRAND, PART.P_TYPE, PART.P_SIZE  (actual time=2178..2196 rows=237868 loops=1)
                -> Stream results  (cost=2.51e+9 rows=12.6e+9) (actual time=28.5..1926 rows=237868 loops=1)
                    -> Hash antijoin (`<subquery2>`.s_suppkey = PARTSUPP.PS_SUPPKEY)  (cost=2.51e+9 rows=12.6e+9) (actual time=28.5..1869 rows=237868 loops=1)
                        -> Nested loop inner join  (cost=223097 rows=630252) (actual time=0.329..1806 rows=237992 loops=1)
                            -> Filter: ((PART.P_BRAND <> 'Brand#34') and (not((PART.P_TYPE like 'LARGE BRUSHED%'))) and (PART.P_SIZE in (48,19,12,4,41,7,21,39)))  (cost=43030 rows=158244) (actual time=0.311..661 rows=59498 loops=1)
                                -> Table scan on PART  (cost=43030 rows=395604) (actual time=0.305..563 rows=400000 loops=1)
                            -> Covering index lookup on PARTSUPP using PRIMARY (PS_PARTKEY=PART.P_PARTKEY)  (cost=0.74 rows=3.98) (actual time=0.018..0.0189 rows=4 loops=59498)
                        -> Hash
                            -> Table scan on <subquery2>  (cost=6703..6954 rows=19929) (actual time=28.2..28.2 rows=10 loops=1)
                                -> Materialize with deduplication  (cost=6703..6703 rows=19929) (actual time=28.2..28.2 rows=10 loops=1)
                                    -> Filter: (SUPPLIER.S_SUPPKEY is not null)  (cost=2111 rows=19929) (actual time=1.03..28.2 rows=10 loops=1)
                                        -> Filter: (SUPPLIER.S_COMMENT like '%Customer%Complaints%')  (cost=2111 rows=19929) (actual time=1.03..28.2 rows=10 loops=1)
                                            -> Table scan on SUPPLIER  (cost=2111 rows=19929) (actual time=0.249..17.9 rows=20000 loops=1)

0.04s user 0.02s system 2% cpu 2.389 total

The output result remains the same, but both execution time and cost number are reduced by more than 50%.
The difference between both query plans is that the first uses an index scan while the second uses a table scan.
It seems using an index unexpectedly increases the cost of the query plan.

Query 16:
EXPLAIN ANALYZE
select
    p_brand,
    p_type,
    p_size,
    count(distinct ps_suppkey) as supplier_cnt
from
    PARTSUPP,
    PART
where
    p_partkey = ps_partkey
    and p_brand <> 'Brand#34'
    and p_type not like 'LARGE BRUSHED%'
    and p_size in (48, 19, 12, 4, 41, 7, 21, 39)
    and ps_suppkey not in (
        select
            s_suppkey
        from
            SUPPLIER
        where
            s_comment like '%Customer%Complaints%'
    )
group by
    p_brand,
    p_type,
    p_size
order by
    supplier_cnt desc,
    p_brand,
    p_type,
    p_size;

TPC-H data scale: 2GB

How to repeat:
Compile MySQL in two versions, one is original and the other is with the changed IF condition. I have attached the patch file to remove the code for your reference.

Patch:
diff --git a/sql/sql_select.cc b/sql/sql_select.cc
index 13584bf3860..b1d17f93b4f 100644
--- a/sql/sql_select.cc
+++ b/sql/sql_select.cc
@@ -3258,21 +3258,10 @@ bool JOIN::setup_semijoin_materialized_table(JOIN_TAB *tab, uint tableno,
       ((uint)tab->idx() == const_tables)
           ? 1.0
           : best_ref[tab->idx() - 1]->position()->prefix_rowcount;
-  if (!sjm_exec->is_scan) {
-    sjm_pos->key = keyuse->begin();  // MaterializeLookup will use the index
-    sjm_pos->read_cost =
-        emb_sj_nest->nested_join->sjm.lookup_cost.total_cost() * fanout;
-    tab->set_keyuse(keyuse->begin());
-    tab->keys().set_bit(0);  // There is one index - use it always
-    tab->set_index(0);
-    sjm_pos->rows_fetched = 1.0;
-    tab->set_type(JT_REF);
-  } else {
     sjm_pos->key = nullptr;  // No index use for MaterializeScan
     sjm_pos->read_cost = tab->read_time * fanout;
     sjm_pos->rows_fetched = static_cast<double>(tab->records());
     tab->set_type(JT_ALL);
-  }
   sjm_pos->set_prefix_join_cost((tab - join_tab), cost_model());
 
   return false;

To facilitate the reproduction, I also provide a docker image of our environment. 
```
docker run -it -p 3306:3306 jinshengba/mysql_tpch_2 /bin/bash
```

In the docker container, you can run the following commands to start the MySQL server:
Starting the original MySQL server:
/app/mysql-server/build/bin/mysqld --user=root --bind-address=0.0.0.0 --datadir=/app/mysql-data

Starting the modified MySQL server:
/app/mysql-server-changed/build/bin/mysqld --user=root --bind-address=0.0.0.0 --datadir=/app/mysql-data

In the host machine, you can connect to the MySQL server in the docker container by running:
mysql -h 127.0.0.1 -P 3306 -uroot -proot tpch < 16.sql

I observed that the number of cost and execution time fluctuate across executions, so I typically run the query multiple times to get a stable result.

May I check whether there is any info I can provide to reproduce/analyze this case?

Hello Jinsheng Ba,

Thank you for the report and feedback.
My apologies for the delay. Could you please share exact make options used for the build, MySQL server configurations file used? Also,  I assume scale factor used is 1G(from your other Bug #116271). 

regards,
Umesh

I used the default configuration options:
```
git clone https://github.com/mysql/mysql-server
cd mysql-server
git checkout 596f0d23
mkdir build
cd build
cmake
make -j$(nproc)
```

For the data scale, this case is based on 2GB. You can check the data in the docker container.

The ".." is missed for cmake.

```
git clone https://github.com/mysql/mysql-server
cd mysql-server
git checkout 596f0d23
mkdir build
cd build
cmake ..
make -j$(nproc)
```

Thank you for the details.

regards,
Umesh

May I ask whether it is reproducible? If so, is the MySQL community interested in such cases?

Hi JINSHENG BA,

Thank you  once again for follow up on this.
As development expect non-docker steps, I was trying to setup everything locally and verify on physical boxes. I had issues building TPC-H, which I fixed today(Thanks to George Ma, who had shared a copy and HOW TO on TPC-H in one of his bug report which I'm using here ) and hopefully by tomorrow will be able to verify/reproduce Bug #116309 &  Bug #116241.  Thank you once again and sorry for the delay.

regards,
Umesh

Thank you for being patient with me on this.
I'm not sure how much this patch fixes the issue but observed the behaviour and hope that development team would take a look at this and take a call further on the suggested patch.  I'll be joining the verification results file shortly. Thank you.

Note - Synopsis says "YPC-H Query 8" but How to/Description uses "Query 16", do you mind if I change Synopsis to reflect the same? Thank you.

regards,
Umesh

9.1.0 - test results (with and without the suggested patch)

Attachment: 116309.results.txt (text/plain), 18.30 KiB.

>I had issues building TPC-H.

Sorry for that! My apologies for using docker. Actually, I am happy to provide my entire data folder or any other info if you need it.  

>I'm not sure how much this patch fixes the issue but observed the behaviour and hope that development team would take a look at this and take a call further on the suggested patch. 

Thanks for looking into it! Yes, I should clarify that it is not a proposed correct patch. Instead, I want to show that there exists a much more efficient query plan for executing TPC-H benchmark, and I am wondering whether we can optimize the the query optimizer to enable the more efficient query plan in default. I believe, at least, it improves the execution efficiency on the TPC-H benchmark.