Bug #105308 Extremely bad plans for star schema using hash joins
Submitted: 23 Oct 2021 0:59 Modified: 25 Oct 2021 17:06
Reporter: Justin Swanhart Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S2 (Serious)
Version:8.0.27 OS:Any
Assigned to: CPU Architecture:Any

[23 Oct 2021 0:59] Justin Swanhart
Description:
This bug uses the data set from bug #100471
  https://bugs.mysql.com/bug.php?id=100471

I have created all of the tables without ANY indexes.  When I use a STRAIGHT_JOIN hint queries can complete, but when not used, a cartesian product that results in at least 12000000000000 tuples is produced.  Obviously, the query will never complete.

I tried adding histograms to all columns used for filtering rows, but it didn't change the plan.

How to repeat:
-- query with straight_join 
mysql> explain format=tree select straight_join count(*) from lineorder join dim_date on lo_orderdatekey = d_datekey join customer   on lo_custkey = c_customerkey join supplier   on lo_suppkey = s_suppkey join p
art   on lo_partkey = p_partkey where c_region = 'AMERICA' and s_region = 'AMERICA' and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2') \G
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0)  (cost=432316985503.91 rows=63771266720)
    -> Inner hash join (part.P_PartKey = lineorder.LO_PartKey)  (cost=425939858831.90 rows=63771266720)
        -> Filter: ((part.P_MFGR = 'MFGR#1') or (part.P_MFGR = 'MFGR#2'))  (cost=0.00 rows=3775)
            -> Table scan on part  (cost=0.00 rows=198707)
        -> Hash
            -> Inner hash join (supplier.S_SuppKey = lineorder.LO_SuppKey)  (cost=54842174200.98 rows=889006023)
                -> Filter: (supplier.S_Region = 'AMERICA')  (cost=0.00 rows=20)
                    -> Table scan on supplier  (cost=0.00 rows=2000)
                -> Hash
                    -> Inner hash join (customer.C_CustomerKey = lineorder.LO_CustKey)  (cost=45940336192.60 rows=4445029484)
                        -> Filter: (customer.C_Region = 'AMERICA')  (cost=0.00 rows=298)
                            -> Table scan on customer  (cost=0.00 rows=29847)
                        -> Hash
                            -> Inner hash join (dim_date.D_DateKey = lineorder.LO_OrderDateKey)  (cost=1489865748.77 rows=1489271570)
                                -> Table scan on dim_date  (cost=0.00 rows=2556)
                                -> Hash
                                    -> Table scan on lineorder  (cost=594195.30 rows=5826571)

1 row in set (0.00 sec)

-- Query without STRAIGHT_JOIN produces a cartesian product:
mysql> explain format=tree select count(*) from lineorder join dim_date on lo_orderdatekey = d_datekey join customer   on lo_custkey = c_customerkey join supplier   on lo_suppkey = s_suppkey join part   on lo_partkey = p_partkey where c_region = 'AMERICA' and s_region = 'AMERICA' and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2') \G
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0)  (cost=430951444293251.10 rows=63771244091939)
    -> Inner hash join (part.P_PartKey = lineorder.LO_PartKey)  (cost=424574319884057.20 rows=63771244091939)
        -> Filter: ((part.P_MFGR = 'MFGR#1') or (part.P_MFGR = 'MFGR#2'))  (cost=0.00 rows=3775)
            -> Table scan on part  (cost=0.00 rows=198707)
        -> Hash
            -> Inner hash join (lineorder.LO_OrderDateKey = dim_date.D_DateKey), (lineorder.LO_CustKey = customer.C_CustomerKey), (lineorder.LO_SuppKey = supplier.S_SuppKey)  (cost=88900722871240.90 rows=889005707119)
                -> Table scan on lineorder  (cost=0.00 rows=5826571)
                -> Hash
                    -> Inner hash join (no condition)  (cost=152632103.00 rows=152577848)
                        -> Filter: (customer.C_Region = 'AMERICA')  (cost=0.01 rows=2985)
                            -> Table scan on customer  (cost=0.01 rows=29847)
                        -> Hash
                            -> Inner hash join (no condition)  (cost=51329.75 rows=511200)
                                -> Table scan on dim_date  (cost=1.30 rows=2556)
                                -> Hash
                                    -> Filter: (supplier.S_Region = 'AMERICA')  (cost=204.75 rows=200)
                                        -> Table scan on supplier  (cost=204.75 rows=2000)

1 row in set (0.00 sec)

-- query with STRAIGHT JOIN completes in 4s
mysql> explain analyze select straight_join count(*) from lineorder join dim_date on lo_orderdatekey = d_datekey join customer   on lo_custkey = c_customerkey join supplier   on lo_suppkey = s_suppkey join part
  on lo_partkey = p_partkey where c_region = 'AMERICA' and s_region = 'AMERICA' and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2') \G
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0)  (cost=432316985503.91 rows=63771266720) (actual time=4251.436..4251.437 rows=1 loops=1)
    -> Inner hash join (part.P_PartKey = lineorder.LO_PartKey)  (cost=425939858831.90 rows=63771266720) (actual time=4101.096..4247.670 rows=89953 loops=1)
        -> Filter: ((part.P_MFGR = 'MFGR#1') or (part.P_MFGR = 'MFGR#2'))  (cost=0.00 rows=3775) (actual time=0.013..121.797 rows=80045 loops=1)
            -> Table scan on part  (cost=0.00 rows=198707) (actual time=0.011..80.059 rows=200000 loops=1)
        -> Hash
            -> Inner hash join (supplier.S_SuppKey = lineorder.LO_SuppKey)  (cost=54842174200.98 rows=889006023) (actual time=4019.712..4050.509 rows=224890 loops=1)
                -> Filter: (supplier.S_Region = 'AMERICA')  (cost=0.00 rows=20) (actual time=0.010..1.129 rows=378 loops=1)
                    -> Table scan on supplier  (cost=0.00 rows=2000) (actual time=0.009..0.884 rows=2000 loops=1)
                -> Hash
                    -> Inner hash join (customer.C_CustomerKey = lineorder.LO_CustKey)  (cost=45940336192.60 rows=4445029484) (actual time=3774.297..3884.522 rows=1192697 loops=1)
                        -> Filter: (customer.C_Region = 'AMERICA')  (cost=0.00 rows=298) (actual time=0.017..14.493 rows=5992 loops=1)
                            -> Table scan on customer  (cost=0.00 rows=29847) (actual time=0.012..11.675 rows=30000 loops=1)
                        -> Hash
                            -> Inner hash join (dim_date.D_DateKey = lineorder.LO_OrderDateKey)  (cost=1489865748.77 rows=1489271570) (actual time=2778.480..3188.283 rows=6001215 loops=1)
                                -> Table scan on dim_date  (cost=0.00 rows=2556) (actual time=0.010..1.859 rows=2556 loops=1)
                                -> Hash
                                    -> Table scan on lineorder  (cost=594195.30 rows=5826571) (actual time=0.015..2256.406 rows=6001215 loops=1)

1 row in set (4.25 sec)

-- will never complete (cancelled after 17 seconds)
mysql> explain analyze select count(*) from lineorder join dim_date on lo_orderdatekey = d_datekey join customer   on lo_custkey = c_customerkey join supplier   on lo_suppkey = s_suppkey join part   on lo_partkey = p_partkey where c_region = 'AMERICA' and s_region = 'AMERICA' and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2') \G
^C^C -- query aborted
1 row in set, 1 warning (16.58 sec)

-- with histograms on all filter and join columns
mysql> explain format=tree select count(*) from lineorder join dim_date on lo_orderdatekey = d_datekey join customer   on lo_custkey = c_customerkey join supplier   on lo_suppkey = s_suppkey join part   on lo_pa
rtkey = p_partkey where c_region = 'AMERICA' and s_region = 'AMERICA' and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2') \G
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0)  (cost=5409085610118571.00 rows=1649260121223893)
    -> Inner hash join (part.P_PartKey = lineorder.LO_PartKey)  (cost=5244159597996182.00 rows=1649260121223893)
        -> Filter: ((part.P_MFGR = 'MFGR#1') or (part.P_MFGR = 'MFGR#2'))  (cost=0.00 rows=7126)
            -> Table scan on part  (cost=0.00 rows=198707)
        -> Hash
            -> Inner hash join (lineorder.LO_OrderDateKey = dim_date.D_DateKey), (lineorder.LO_CustKey = customer.C_CustomerKey), (lineorder.LO_SuppKey = supplier.S_SuppKey)  (cost=645320072529682.10 rows=6453195129827)
                -> Table scan on lineorder  (cost=0.00 rows=5609451)
                -> Hash
                    -> Inner hash join (no condition)  (cost=576074868.69 rows=1150414688)
                        -> Filter: (customer.C_Region = 'AMERICA')  (cost=0.00 rows=5961)
                            -> Table scan on customer  (cost=0.00 rows=29847)
                        -> Hash
                            -> Inner hash join (no condition)  (cost=96826.55 rows=966168)
                                -> Table scan on dim_date  (cost=0.69 rows=2556)
                                -> Hash
                                    -> Filter: (supplier.S_Region = 'AMERICA')  (cost=204.75 rows=378)
                                        -> Table scan on supplier  (cost=204.75 rows=2000)
[23 Oct 2021 1:05] Justin Swanhart
Fix synopsis
[23 Oct 2021 1:08] Justin Swanhart
Full schema (only lineorder is included in original bug):
mysql> show create table lineorder\G
*************************** 1. row ***************************
       Table: lineorder
Create Table: CREATE TABLE `lineorder` (
  `LO_OrderKey` bigint NOT NULL,
  `LO_LineNumber` tinyint NOT NULL,
  `LO_CustKey` int NOT NULL,
  `LO_PartKey` int NOT NULL,
  `LO_SuppKey` int NOT NULL,
  `LO_OrderDateKey` int NOT NULL,
  `LO_OrderPriority` varchar(15) DEFAULT NULL,
  `LO_ShipPriority` char(1) DEFAULT NULL,
  `LO_Quantity` tinyint DEFAULT NULL,
  `LO_ExtendedPrice` int DEFAULT NULL,
  `LO_OrdTotalPrice` int DEFAULT NULL,
  `LO_Discount` int DEFAULT NULL,
  `LO_Revenue` int DEFAULT NULL,
  `LO_SupplyCost` int DEFAULT NULL,
  `LO_Tax` tinyint DEFAULT NULL,
  `LO_CommitDateKey` int NOT NULL,
  `LO_ShipMode` varchar(10) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)

mysql> show create table dim_date\G
*************************** 1. row ***************************
       Table: dim_date
Create Table: CREATE TABLE `dim_date` (
  `D_DateKey` int DEFAULT NULL,
  `D_Date` char(18) DEFAULT NULL,
  `D_DayOfWeek` char(9) DEFAULT NULL,
  `D_Month` char(9) DEFAULT NULL,
  `D_Year` smallint DEFAULT NULL,
  `D_YearMonthNum` int DEFAULT NULL,
  `D_YearMonth` char(7) DEFAULT NULL,
  `D_DayNumInWeek` tinyint DEFAULT NULL,
  `D_DayNumInMonth` tinyint DEFAULT NULL,
  `D_DayNumInYear` smallint DEFAULT NULL,
  `D_MonthNumInYear` tinyint DEFAULT NULL,
  `D_WeekNumInYear` tinyint DEFAULT NULL,
  `D_SellingSeason` char(12) DEFAULT NULL,
  `D_LastDayInWeekFl` tinyint DEFAULT NULL,
  `D_LastDayInMonthFl` tinyint DEFAULT NULL,
  `D_HolidayFl` tinyint DEFAULT NULL,
  `D_WeekDayFl` tinyint DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)

mysql> show create table supplier\G
*************************** 1. row ***************************
       Table: supplier
Create Table: CREATE TABLE `supplier` (
  `S_SuppKey` int DEFAULT NULL,
  `S_Name` char(25) DEFAULT NULL,
  `S_Address` varchar(25) DEFAULT NULL,
  `S_City` char(10) DEFAULT NULL,
  `S_Nation` char(15) DEFAULT NULL,
  `S_Region` char(12) DEFAULT NULL,
  `S_Phone` char(15) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)

mysql> show create table customer\G
*************************** 1. row ***************************
       Table: customer
Create Table: CREATE TABLE `customer` (
  `C_CustomerKey` int DEFAULT NULL,
  `C_Name` varchar(25) DEFAULT NULL,
  `C_Address` varchar(25) DEFAULT NULL,
  `C_City` varchar(10) DEFAULT NULL,
  `C_Nation` varchar(15) DEFAULT NULL,
  `C_Region` varchar(12) DEFAULT NULL,
  `C_Phone` varchar(15) DEFAULT NULL,
  `C_MktSegment` varchar(10) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)

mysql> show create table part\G
*************************** 1. row ***************************
       Table: part
Create Table: CREATE TABLE `part` (
  `P_PartKey` int DEFAULT NULL,
  `P_Name` varchar(25) DEFAULT NULL,
  `P_MFGR` varchar(10) DEFAULT NULL,
  `P_Category` varchar(10) DEFAULT NULL,
  `P_Brand` varchar(15) DEFAULT NULL,
  `P_Colour` varchar(15) DEFAULT NULL,
  `P_Type` varchar(25) DEFAULT NULL,
  `P_Size` tinyint DEFAULT NULL,
  `P_Container` char(10) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)
[23 Oct 2021 2:03] Justin Swanhart
Here is the optimizer trace:
https://gist.github.com/greenlion/979e0b45d121c504a31a1ce35f67a85e
[23 Oct 2021 13:09] Justin Swanhart
It looks to me like cost estimation is always table1:rowcount * table2:rowcount but in a star schema, all of the joins are equijoins that can only produce a single row.  I had hoped adding histograms would allow the optimizer to realize that the column was unique.  

Ideally, the optimizer should recognize a star schema and switch the optimizer to set the cost estimate to table1:rowcount
[23 Oct 2021 14:01] Justin Swanhart
Also the cost of the scan on lineorder is 0?  That seems odd to me, but I do not know if it is.
[25 Oct 2021 13:10] MySQL Verification Team
Hi Mr. Swanhart,

Thank you for your bug report.

First, let us ask you one question, which you can reply if you already know the answer. Have you noticed the same behaviour in our earlier releases of 8.0 ???? If you have not tested it with previous releases, then ignore this question.

In order to process your report, we must be able to repeat the behaviour that you are seeing. Hence, please provide us with a sufficient number of rows so that we can get the same outcome.

Many thanks, in advance !!!!!!
[25 Oct 2021 13:14] MySQL Verification Team
HI,

In addition to what we asked for above, you should also provide us with the ANALYSE statements, so that we could repeat what you tried to achieve with histograms ......
[25 Oct 2021 13:34] Justin Swanhart
Hi,

This problem exists in all versions of 8.0 since hash join was introduced.  It is possible that BNL join produced the same plan, but I would never try to execute the SSB using BNL so I don't know.

This is a 6GB schema which is generated by the SSB (star schema benchmark) data generator, and since it is synthetically generated by a program, it does not make sense to upload it.  

The generator is in the link in the bug report to the prior bug.  It was successfully used to verify that bug, so I imagine it can be used to verify this one too.

As for histograms, I used:
ANALYZE TABLE supplier UPDATE HISTOGRAM FROM S_SuppKey, S_Region;
ANALYZE TABLE customer UPDATE HISTOGRAM FROM C_CustomerKey, C_Region;
ANALYZE TABLE part UPDATE HISTOGRAM FROM P_PartKey, P_Mfgr;
ANALYZE TABLE lineorder UPDATE HISTOGRAM FROM LO_PartKey, LO_CustKey, LO_SuppKey, LO_OrderDateKey;
[25 Oct 2021 13:38] MySQL Verification Team
Thank you Mr. Swanhart,

Generator is truly there .......

We were able to repeat the behaviour on 8.0.27.

We are not able to inform you on the plans of optimising further the star join schema, so we can not yet let you know whether this will finish as a bug or as a feature request. For the moment, it is verifies as a bug of the severity 2.

Verified as reported.
[25 Oct 2021 13:48] Justin Swanhart
I omitted the dim_date D_DateKey histogram.  Note that I tried with the default 100 buckets and also with 1024 buckets for all the histograms, but since the plan was the same, I am not sure which one I pasted the results for.
[25 Oct 2021 14:07] Justin Swanhart
I think fixing the prior bug would fix this problem for me, because exact row estimates would be provided by more storage engine, which actually does a parallel hash join of all the rows internally, and then MySQL just builds rows using hash memory lookups after the first table is scanned.

As it is, I have implemented a rewrite plugin which properly orders tables and rewrites all queries to STRAIGHT_JOIN. I believe the proper reaction to that is *sigh* :)
[25 Oct 2021 14:10] Justin Swanhart
sorry, *more* storage engine should read, *my* storage engine.
[25 Oct 2021 16:45] Justin Swanhart
Perhaps something like this (or something in the SQL standard):
ALTER TABLE [TABLE] [ADD|UPDATE] HEURISTIC UNIQUE FROM [COLUMN,...];

That would allow one to apply optimizer heuristics to columns that are hard to discover without using a lot of memory, and are hard to verify are correct. Many databases do not enforece constraints (RedShift is an example) that allow the optimizer to produce good plans that will unfortunately produce wrong results if the constraints are provided for non-conforming data in the table.

The optimizer would produce a correct number of output rows if it knows about uniqueness in the dimension tables.
[25 Oct 2021 16:53] Justin Swanhart
This would not require any parser token additions:
ALTER TABLE [TABLE] ADD CONSTRAINT ... VIRTUAL;

A virtual constraint would provide information to the optimizer but would not be enforced by the storage engine.  

That unfortunately would require all storage engines to support constraints, or moving constraints to the SE level.
[25 Oct 2021 17:06] Justin Swanhart
sorry again.  I meant moving some constraints from the SE to the SQL engine, specifically the constraints would only be considered by the optimizer.  If you tried to add a non-virtual constraint to a table that doesn't support indexes, then an error would be produced by the storage engine.
[26 Oct 2021 12:15] MySQL Verification Team
Mr. Swanhart,

Thank you for your contributions !!!