Bug #115313 | hash join when have index and push condition to driver table lead to table scan | ||
---|---|---|---|
Submitted: | 13 Jun 2024 9:25 | Modified: | 14 Jun 2024 9:25 |
Reporter: | jia liu | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Server: Optimizer | Severity: | S5 (Performance) |
Version: | 8.0.32 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | filtered read_cost eval_cost |
[13 Jun 2024 9:25]
jia liu
[13 Jun 2024 10:36]
MySQL Verification Team
Hi Mr. Liu, Thank you for your bug report. However, this is not a bug. This is the best possible manner in which any optimiser could resolve queries like yours. Here are our results: *************************** 1. row *************************** EXPLAIN: -> Aggregate: count(0) (cost=14.8 rows=1) (actual time=0.348..0.349 rows=1 loops=1) -> Nested loop inner join (cost=13.8 rows=10.2) (actual time=0.346..0.346 rows=0 loops=1) -> Filter: ((t1.data2 = 'abc') and (t1.data3 = 'abc') and (t1.data4 = 'abc') and (t1.data5 = 'abc') and (t1.data6 = 'abc') and (t1.data7 = 'abc') and (t1.data8 = 'abc') and (t1.data9 = 'abc') and (t1.data10 = 'abc')) (cost=3.51 rows=0.05) (actual time=0.345..0.345 rows=0 loops=1) -> Index range scan on t1 using data1 over (data1 = '1') OR (data1 = '2') OR (3 more), with index condition: (t1.data1 in ('1','2','3','4','5')) (cost=3.51 rows=5) (actual time=0.345..0.345 rows=0 loops=1) -> Filter: (t2.data1 = t1.data1) (cost=410 rows=205) (never executed) -> Index lookup on t2 using data2 (data2='abc'), with index condition: (t2.data2 = t1.data2) (cost=410 rows=2048) (never executed) *************************** 1. row *************************** EXPLAIN: -> Aggregate: count(0) (cost=13.8 rows=1) (actual time=0.338..0.338 rows=1 loops=1) -> Nested loop inner join (cost=13.8 rows=0.05) (actual time=0.337..0.337 rows=0 loops=1) -> Filter: ((t1.data2 = 'abc') and (t1.data3 = 'abc') and (t1.data4 = 'abc') and (t1.data5 = 'abc') and (t1.data6 = 'abc') and (t1.data7 = 'abc') and (t1.data8 = 'abc') and (t1.data9 = 'abc') and (t1.data10 = 'abc')) (cost=3.51 rows=0.05) (actual time=0.336..0.336 rows=0 loops=1) -> Index range scan on t1 using data1 over (data1 = '1') OR (data1 = '2') OR (3 more), with index condition: (t1.data1 in ('1','2','3','4','5')) (cost=3.51 rows=5) (actual time=0.336..0.336 rows=0 loops=1) -> Filter: ((t2.data1 = t1.data1) and (t2.data3 = 'abc') and (t2.data4 = 'abc') and (t2.data5 = 'abc') and (t2.data6 = 'abc') and (t2.data7 = 'abc') and (t2.data8 = 'abc') and (t2.data9 = 'abc') and (t2.data10 = 'abc')) (cost=2.75 rows=1) (never executed) -> Index lookup on t2 using data2 (data2='abc'), with index condition: (t2.data2 = t1.data2) (cost=2.75 rows=2048) (never executed) That shows what an excellent optimiser we have. Not a bug.
[14 Jun 2024 3:11]
jia liu
I still cannot reproduce this problem in a test environment, so here is some supplementary information from the production environment, version 8.0.32 From my opinion, while optimizer thinks table 'a' has an filter, makes only very few rows to scan, and this condition `a`.aac001 IN ('7889889','78899400','78894917','78894921','78849922') is pushed to table 'x', and not apply to table 'a' anymore. but finally a hash join scans table a resulting a large table scan: # Query_time: 187.081587 Lock_time: 0.000005 Rows_sent: 1 Rows_examined: 8862344 SET timestamp=1718330279; SELECT COUNT(*) AS `count(*)` FROM `AC97` AS a,`AC02` AS x WHERE `a`.aac001 = `x`.aac001 AND `a`.aaz159 = `x`.aaz159 AND `x`.aae140 = '180' AND `x`.aae100 = '1' AND `x`.aac008 = '1' AND `a`.aac001 IN ('7889889','78899400','78894917','78894921','78849922'); And here is the execution plan and the optimizer trace result, hope that will provide some information. see attached files later.
[14 Jun 2024 9:19]
jia liu
https://bugs.mysql.com/bug.php?id=97302 I have read this related feature request, seems that hash join could been choose when there is index now. In my case, there is index on table 'a', so the read cost is low, that is fine. The where condition is pushed to driver table 'x', which is grate. Where conditions on table 'a' is eliminated, because it is not need, is an inner join, it is also fine if a ref join will be used. But later some mechanism choose to use hash join, may be cost base, which will made the eliminated condition on 'a' become a improper move in return. Finally hash join with a table scan execued. This sequence looks like very reasonable, I think it is really close to what happened. A table scan in hash join without index is normal, but the cost is based on index range scan, it is mismatched somehow. Or the where condition is reserved and take effect on the drivened table, also will be fine. Please help with me, thank you very much.
[14 Jun 2024 9:25]
jia liu
And in production environment set session optimizer_switch='hash_join=off'; hint /*+ no_hash_join(a,x)*/ both not working to stop optimizer to choose hash join for this SQL.
[14 Jun 2024 9:33]
MySQL Verification Team
Hi Mr. liu, That feature request is not applicable in your case. The optimiser plan that is chosen for your query is the best one. Is it possible to optimise further ??? Always. However, that logic must be avoided at all costs. This kind of reasoning has lead and will always lead to the over-optimisation. That has already happened too many times. Over-optimisation means that a query spends more time in the optimisation phase , then in the execution phase. This is not a road that we shall ever take. Not a bug.