MySQL Bugs: #115313: hash join when have index and push condition to driver table lead to table scan

Bug #115313	hash join when have index and push condition to driver table lead to table scan
Submitted:	13 Jun 2024 9:25	Modified:	14 Jun 2024 9:25
Reporter:	jia liu	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S5 (Performance)
Version:	8.0.32	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	filtered read_cost eval_cost

Description:
In production environment, a SQL choose a poor execution plan which use hash join. we found out that the read_cost is unreasonably low, 0.23, and filtered is 0.00:

{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "7.93"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "x",
          "access_type": "range",
          "possible_keys": [
            "PRIMARY",
            "IDX_AC02_AAC001"
          ],
          "key": "IDX_AC02_AAC001",
          "used_key_parts": [
            "AAC001"
          ],
          "key_length": "9",
          "rows_examined_per_scan": 4,
          "rows_produced_per_join": 0,
          "filtered": "1.25",
          "index_condition": "(`db04`.`x`.`AAC001` in ('7889921','78809900','78089922'))",
          "cost_info": {
            "read_cost": "7.67",
            "eval_cost": "0.01",
            "prefix_cost": "7.68",
            "data_read_per_join": "123"
          },
          "used_columns": [
            "AAZ159",
            "AAC001",
            "AAE140",
            "AAE100"
          ],
          "attached_condition": "((`db04`.`x`.`AAE100` = '1') and (`db04`.`x`.`AAE140` = '180'))"
        }
      },
      {
        "table": {
          "table_name": "a",
          "access_type": "ALL",
          "possible_keys": [
            "IDX_AC97_AAC001",
            "INDEX_AAZ159_AAC001"
          ],
          "rows_examined_per_scan": 9387955,
          "rows_produced_per_join": 0,
          "filtered": "0.00",
          "using_join_buffer": "hash join",
          "cost_info": {
            "read_cost": "0.23",
            "eval_cost": "0.00",
            "prefix_cost": "7.93",
            "data_read_per_join": "11"
          },
          "used_columns": [
            "AAC001",
            "AAZ159"
          ],
          "attached_condition": "((`db04`.`a`.`AAZ159` = `db04`.`x`.`AAZ159`) and (`db04`.`a`.`AAC001` = `db04`.`x`.`AAC001`))"
        }
      }
    ]
  }
}

I think that is why MySQL choose hash join instead of ref join.
A force index shows a larger cost.

I failed to reproduce the problem, but I can verify that more where condition will result eval_cost and filtered lower and lower.
At some point, the execution plan may go wrong.

How to repeat:

create table t1 (id serial,data1 varchar(10),data2 varchar(10),data3 varchar(10),data4 varchar (10),data5 varchar(10),data6 varchar(10),data7 varchar(10),data8 varchar(10),data9 varchar(10),data10 varchar(10),key (data1),key (data2));
create table t2 (id serial,data1 varchar(10),data2 varchar(10),data3 varchar(10),data4 varchar (10),data5 varchar(10),data6 varchar(10),data7 varchar(10),data8 varchar(10),data9 varchar(10),data10 varchar(10),key (data2));

insert into t1 select null,'abc','abc','abc','abc','abc','abc','abc','abc','abc','abc';
insert into t1 select null,'abc','abc','abc','abc','abc','abc','abc','abc','abc','abc' from t1;
insert into t1 select null,'abc','abc','abc','abc','abc','abc','abc','abc','abc','abc' from t1;
insert into t1 select null,'abc','abc','abc','abc','abc','abc','abc','abc','abc','abc' from t1;
insert into t1 select null,'abc','abc','abc','abc','abc','abc','abc','abc','abc','abc' from t1;
insert into t1 select null,'abc','abc','abc','abc','abc','abc','abc','abc','abc','abc' from t1;
insert into t1 select null,'abc','abc','abc','abc','abc','abc','abc','abc','abc','abc' from t1;
insert into t1 select null,'abc','abc','abc','abc','abc','abc','abc','abc','abc','abc' from t1;
insert into t1 select null,'abc','abc','abc','abc','abc','abc','abc','abc','abc','abc' from t1;
insert into t1 select null,'abc','abc','abc','abc','abc','abc','abc','abc','abc','abc' from t1;
insert into t1 select null,'abc','abc','abc','abc','abc','abc','abc','abc','abc','abc' from t1;
insert into t2 select null,'abc','abc','abc','abc','abc','abc','abc','abc','abc','abc' from t1;
insert into t2 select null,'abc','abc','abc','abc','abc','abc','abc','abc','abc','abc' from t2;
insert into t2 select null,'abc','abc','abc','abc','abc','abc','abc','abc','abc','abc' from t2;
insert into t2 select null,'abc','abc','abc','abc','abc','abc','abc','abc','abc','abc' from t2;
insert into t2 select null,'abc','abc','abc','abc','abc','abc','abc','abc','abc','abc' from t2;
insert into t2 select null,'abc','abc','abc','abc','abc','abc','abc','abc','abc','abc' from t2;
insert into t2 select null,'abc','abc','abc','abc','abc','abc','abc','abc','abc','abc' from t2;
insert into t2 select null,'abc','abc','abc','abc','abc','abc','abc','abc','abc','abc' from t2;
insert into t2 select null,'abc','abc','abc','abc','abc','abc','abc','abc','abc','abc' from t2;
insert into t1 select null,'1','abc','abc','abc','abc','abc','abc','abc','abc','abc';

explain format=json select count(*) from t1,t2 where t1.data1=t2.data1 and t1.data2=t2.data2 and t1.data1 in ('1','2','3','4','5') and t1.data2='abc' and t1.data3='abc' and t1.data4='abc' and t1.data5='abc' and t1.data6='abc' and t1.data7='abc' and t1.data8='abc' and t1.data9='abc' and t1.data10='abc' \G

explain format=json select count(*) from t1,t2 where t1.data1=t2.data1 and t1.data2=t2.data2 and t1.data1 in ('1','2','3','4','5') and t1.data2='abc' and t1.data3='abc' and t1.data4='abc' and t1.data5='abc' and t1.data6='abc' and t1.data7='abc' and t1.data8='abc' and t1.data9='abc' and t1.data10='abc' and t2.data3='abc' and t2.data4='abc' and t2.data5='abc' and t2.data6='abc' and t2.data7='abc' and t2.data8='abc' and t2.data9='abc' and t2.data10='abc' \G

The first explain shows:
          "filtered": "10.00",
          "index_condition": "(`test`.`t2`.`data2` = `test`.`t1`.`data2`)",
          "cost_info": {
            "read_cost": "44.51",
            "eval_cost": "65.31",
            "prefix_cost": "701.08",
            "data_read_per_join": "270K"

The second  explain shows:

          "filtered": "0.00",
          "index_condition": "(`test`.`t2`.`data2` = `test`.`t1`.`data2`)",
          "cost_info": {
            "read_cost": "44.51",
            "eval_cost": "0.00",
            "prefix_cost": "701.08",
            "data_read_per_join": "10"

Suggested fix:
I think many after filter where conditions should not lower filtered to zero.
Or my guess is wrong, it is some bug in hash join ignore read_cost or set read_cost very low? Or when there is a index covers both column?

Hi Mr. Liu,

Thank you for your bug report.

However, this is not a bug.

This is the best possible manner in which any optimiser could resolve queries like yours.

Here are our results:

*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0)  (cost=14.8 rows=1) (actual time=0.348..0.349 rows=1 loops=1)
    -> Nested loop inner join  (cost=13.8 rows=10.2) (actual time=0.346..0.346 rows=0 loops=1)
        -> Filter: ((t1.data2 = 'abc') and (t1.data3 = 'abc') and (t1.data4 = 'abc') and (t1.data5 = 'abc') and (t1.data6 = 'abc') and (t1.data7 = 'abc') and (t1.data8 = 'abc') and (t1.data9 = 'abc') and (t1.data10 = 'abc'))  (cost=3.51 rows=0.05) (actual time=0.345..0.345 rows=0 loops=1)
            -> Index range scan on t1 using data1 over (data1 = '1') OR (data1 = '2') OR (3 more), with index condition: (t1.data1 in ('1','2','3','4','5'))  (cost=3.51 rows=5) (actual time=0.345..0.345 rows=0 loops=1)
        -> Filter: (t2.data1 = t1.data1)  (cost=410 rows=205) (never executed)
            -> Index lookup on t2 using data2 (data2='abc'), with index condition: (t2.data2 = t1.data2)  (cost=410 rows=2048) (never executed)

*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0)  (cost=13.8 rows=1) (actual time=0.338..0.338 rows=1 loops=1)
    -> Nested loop inner join  (cost=13.8 rows=0.05) (actual time=0.337..0.337 rows=0 loops=1)
        -> Filter: ((t1.data2 = 'abc') and (t1.data3 = 'abc') and (t1.data4 = 'abc') and (t1.data5 = 'abc') and (t1.data6 = 'abc') and (t1.data7 = 'abc') and (t1.data8 = 'abc') and (t1.data9 = 'abc') and (t1.data10 = 'abc'))  (cost=3.51 rows=0.05) (actual time=0.336..0.336 rows=0 loops=1)
            -> Index range scan on t1 using data1 over (data1 = '1') OR (data1 = '2') OR (3 more), with index condition: (t1.data1 in ('1','2','3','4','5'))  (cost=3.51 rows=5) (actual time=0.336..0.336 rows=0 loops=1)
        -> Filter: ((t2.data1 = t1.data1) and (t2.data3 = 'abc') and (t2.data4 = 'abc') and (t2.data5 = 'abc') and (t2.data6 = 'abc') and (t2.data7 = 'abc') and (t2.data8 = 'abc') and (t2.data9 = 'abc') and (t2.data10 = 'abc'))  (cost=2.75 rows=1) (never executed)
            -> Index lookup on t2 using data2 (data2='abc'), with index condition: (t2.data2 = t1.data2)  (cost=2.75 rows=2048) (never executed)

That shows what an excellent optimiser we have.

Not a bug.

I still cannot reproduce this problem in a test environment, so here is some supplementary information from the production environment, version 8.0.32

From my opinion, while optimizer thinks table 'a' has an filter, makes only very few rows to scan,
and this condition `a`.aac001 IN ('7889889','78899400','78894917','78894921','78849922')  is pushed to table 'x', and not apply to table 'a' anymore.
but finally a hash join scans table a resulting a large table scan:

# Query_time: 187.081587  Lock_time: 0.000005 Rows_sent: 1  Rows_examined: 8862344
SET timestamp=1718330279;
SELECT COUNT(*) AS `count(*)` FROM `AC97` AS a,`AC02` AS x WHERE `a`.aac001 = `x`.aac001 AND `a`.aaz159 = `x`.aaz159 AND `x`.aae140 = '180' AND `x`.aae100 = '1' AND `x`.aac008 = '1' AND `a`.aac001 IN ('7889889','78899400','78894917','78894921','78849922');

And here is the execution plan and the optimizer trace result, hope that will provide some information.

see attached files later.

https://bugs.mysql.com/bug.php?id=97302
I have read this related feature request,
seems that hash join could been choose when there is index now.

In my case, there is index on table 'a', so the read cost is low, that is fine.
The where condition is pushed to driver table 'x', which is grate.
Where conditions on table 'a' is eliminated, because it is not need, is an inner join, it is also fine if a ref join will be used.
But later some mechanism choose to use hash join, may be cost base, which will made the eliminated condition on 'a' become a improper move in return.
Finally hash join with a table scan execued.

This sequence looks like very reasonable, I think it is really close to what happened.

A table scan in hash join without index is normal, but the cost is based on index range scan,  it is mismatched somehow.
Or the where condition is reserved and take effect on the drivened table, also will be fine.

Please help with me, thank you very much.

And in production environment
set session optimizer_switch='hash_join=off';
hint /*+ no_hash_join(a,x)*/

both not working to stop optimizer to choose hash join for this SQL.

Hi Mr. liu,

That feature request is not applicable in your case.

The optimiser plan that is chosen for your query is the best one.

Is it possible to optimise further ??? Always. However, that logic must be avoided at all costs.

This kind of reasoning has lead and will always lead to the over-optimisation. That has already happened too many times.

Over-optimisation means that a query spends more time in the optimisation phase , then in the execution phase.

This is not a road that we shall ever take.

Not a bug.