Bug #107607 left join + semijoin + materialization finds false matches
Submitted: 20 Jun 2022 9:34 Modified: 20 Jun 2022 9:51
Reporter: Guilhem Bichot Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S2 (Serious)
Version:8.0.29 OS:Ubuntu
Assigned to: CPU Architecture:x86

[20 Jun 2022 9:34] Guilhem Bichot
Description:
It could be duplicate of something in:
https://bugs.mysql.com/search.php?search_for=materialization&status=Active&severity=&limit...
but I'm not sure because, contrary to the bugs above, my query has the subquery in the JOIN ON condition.

Verified in 8.0.29 and 8.0.18.

Please, run the mtr test in "how to repeat".
Results:
SELECT /*+ join_prefix(t1,t2) */ t1.id, t2.id
FROM t1 LEFT JOIN t2 ON t1.id = t2.id2 AND
t2.id IN (SELECT /*+ semijoin(materialization) */ id3 FROM t3 WHERE id4 = 100  AND val = 2) ;
id	id
1	3
SELECT /*+ join_prefix(t1,t2) */ t1.id, t2.id
FROM t1 LEFT JOIN t2 ON t1.id = t2.id2 AND
t2.id IN (SELECT /*+ no_semijoin(materialization) */ id3 FROM t3 WHERE id4 = 100  AND val = 2) ;
id	id
1	NULL

the first result is wrong.

How to repeat:
CREATE TABLE t1 (
  id int NOT NULL
);
INSERT INTO t1 VALUES (1);
CREATE TABLE t2 (
  id int NOT NULL,
  id2 int NOT NULL
);
INSERT INTO t2 VALUES (3,1);
CREATE TABLE t3 (
  id3 int NOT NULL,
  id4 int NOT NULL,
  val int NOT NULL
);
INSERT INTO t3 VALUES (3,100,4);
analyze table t1,t2,t3;
SELECT /*+ join_prefix(t1,t2) */ t1.id, t2.id
FROM t1 LEFT JOIN t2 ON t1.id = t2.id2 AND
 t2.id IN (SELECT /*+ semijoin(materialization) */ id3 FROM t3 WHERE id4 = 100  AND val = 2) ;
SELECT /*+ join_prefix(t1,t2) */ t1.id, t2.id
FROM t1 LEFT JOIN t2 ON t1.id = t2.id2 AND
 t2.id IN (SELECT /*+ no_semijoin(materialization) */ id3 FROM t3 WHERE id4 = 100  AND val = 2) ;
drop table t1,t2,t3;

Suggested fix:
Looking at EXPLAIN of the wrong query:
explain format=tree SELECT /*+ join_prefix(t1,t2) */ t1.id, t2.id
FROM t1 LEFT JOIN t2 ON t1.id = t2.id2 AND
t2.id IN (SELECT /*+ semijoin(materialization) */ id3 FROM t3 WHERE id4 = 100  AND val = 2) ;
EXPLAIN
-> Nested loop left join  (cost=0.70 rows=1)
    -> Table scan on t1  (cost=0.35 rows=1)
    -> Nested loop inner join  (cost=0.55 rows=1)
        -> Filter: (t2.id2 = t1.id)  (cost=0.35 rows=1)
            -> Table scan on t2  (cost=0.35 rows=1)
        -> Single-row index lookup on <subquery2> using <auto_distinct_key> (id3=t2.id)
            -> Materialize with deduplication  (cost=0.45..0.45 rows=1)
                -> Filter: (t3.id3 is not null)  (cost=0.35 rows=1)
                    -> Table scan on t3  (cost=0.35 rows=1)

you can see that "val=2" is nowhere in EXPLAIN. Apparently it has been lost, hence the false matches. There is actually no row with val=2 in t3.
[20 Jun 2022 9:37] MySQL Verification Team
Hello Guilhem,

Thank you for the report and test case.

regards,
Umesh
[20 Jun 2022 9:51] Guilhem Bichot
Made queries a bit more minimal:
SELECT t1.id, t2.id FROM t1 LEFT JOIN t2 ON t2.id IN (SELECT /*+ semijoin(materialization) */ id3 FROM t3 WHERE val = 2) ;
SELECT t1.id, t2.id FROM t1 LEFT JOIN t2 ON t2.id IN (SELECT /*+ no_semijoin(materialization) */ id3 FROM t3 WHERE val = 2) ;