Bug #113718 left join + semijoin + materialization finds false matches
Submitted: 23 Jan 2024 4:42 Modified: 23 Jan 2024 7:19
Reporter: Zeng Zihao Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S2 (Serious)
Version:8.0.35, 8.0.36, 8.3.0 OS:Any
Assigned to: CPU Architecture:Any

[23 Jan 2024 4:42] Zeng Zihao
Description:
left join + semijoin + materialization finds false matches. 8.0.31 try to fix this bug but only consider the derieved table cases.https://github.com/mysql/mysql-server/commit/8e7258b49827cd9c7be6f32bd63375df0637e731. 

The query looks like this:
SELECT FROM t1 LEFT JOIN t2 ON t2.id IN (SELECT t3.id FROM t3 WHERE COND)

yielding a join tree of:

t1 LEFT JOIN (t2 SEMIJOIN t3) ON t2.id=t3.id AND COND

then, semijoin materialization is chosen:
t1 LEFT JOIN (t2 SEMIJOIN mat-tmp) ON t2.id=mat-tmp.id AND COND

then JOIN::attach_join_conditions() tries to find where COND now is,
to push it down into the definition of mat-tmp, as COND should be used
to eliminate some rows of t3 before they are inserted into mat-tmp.
The bug is that it expects to find COND attached to mat-tmp, and it is
not always the case. Here COND is part of 'ON t2.id=mat-tmp.id AND
COND', which is the condition of the LEFT JOIN and is thus attached to
the first table on the right side of the LEFT JOIN: t2.

How to repeat:
Test case:
CREATE TABLE t1 (
  id int NOT NULL
);
INSERT INTO t1 VALUES (1);
CREATE TABLE t2 (
  id int NOT NULL,
  id2 int NOT NULL
);
INSERT INTO t2 VALUES (3,1);
CREATE TABLE t3 (
  id3 int NOT NULL,
  id4 int NOT NULL,
  val int NOT NULL
);
INSERT INTO t3 VALUES (3,100,4);
analyze table t1,t2,t3;
 
SELECT /*+ join_prefix(t1,t2) */ t1.id, t2.id
FROM t1 LEFT JOIN t2 ON t1.id = t2.id2 AND
 t2.id IN (SELECT /*+ semijoin(materialization) */ id3 FROM t3 WHERE id4 = 100  AND val = 2) ;

The result should be '1 NULL' rather than '1 3'.

Suggested fix:
This patch from 8.0.31 only consider the derived table case. Therefore, it results the problem. https://github.com/mysql/mysql-server/commit/8e7258b49827cd9c7be6f32bd63375df0637e731. 

The problem is in sql_optimizer.cc. I think it is unnecessary to check the condition tr->embedding->is_derived() and the patch should be

      Item *join_cond = best_ref[mat_tbl]->join_cond();
      Table_ref *tr = best_ref[mat_tbl]->table_ref;
      while (join_cond == nullptr && tr->embedding != nullptr) {
        // If subquery table(s) come from a derived table
        join_cond = tr->embedding->join_cond();
        tr = tr->embedding;
      }
[23 Jan 2024 7:19] MySQL Verification Team
Hello Zeng Zihao,

Thank you for the report and test case.
Verified as described.

regards,
Umesh