Bug #120457 Hypergraph optimizer returns wrong results for LEFT JOIN LATERAL with SEMI JOIN when derived_merge
Submitted: 12 May 15:03 Modified: 17 May 23:17
Reporter: ZHIQIANG SHI Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S2 (Serious)
Version:9.7 OS:CentOS
Assigned to: CPU Architecture:x86
Tags: hypergraph semijoin

[12 May 15:03] ZHIQIANG SHI
Description:
This appears to be an interaction between:

1. semijoin transformation of the EXISTS subquery inside the lateral derived table, 
2. derived table merge,
3. outer join condition handling in the hypergraph optimizer.

Before derived table merge, the EXISTS subquery is in the WHERE clause of the lateral derived table, so it is eligible for semijoin transformation.

During semijoin transformation, the semijoin condition built from the correlated subquery is not stored as the join condition of the semijoin nest itself for a normal semijoin. MySQL builds sj_cond, but injects it into the parent query block condition: into the parent WHERE condition if the subquery was in WHERE, or into the parent join nest's ON condition if the subquery was in an ON clause. Thus the semijoin nest itself may have no join condition.

After derived_merge, the derived table's WHERE condition is merged into the outer LEFT JOIN condition. The resulting join tree is equivalent to:

t1 LEFT JOIN (t2 SEMIJOIN t3 ON true)  
ON t1.a = t2.a AND t1.b = t3.a

The traditional optimizer can handle the merged shape correctly because it later splits outer join conditions into guarded table conditions.

In the hypergraph optimizer, however, the generated tree has the condition t1.b = t3.a on the outer LEFT_JOIN, while the semijoin side itself has no condition. As a result, the semijoin may be planned as if t3 only needs to produce any row, instead of evaluating the correlated EXISTS predicate for each outer t1 row.

How to repeat:
DROP TABLE IF EXISTS t1;
DROP TABLE IF EXISTS t2;
DROP TABLE IF EXISTS t3;

CREATE TABLE t1 (
  a INT,
  b INT
);

CREATE TABLE t2 (
  a INT,
  b INT
);

CREATE TABLE t3 (
  a INT
);

INSERT INTO t1 VALUES
  (1, 10),
  (2, 20),
  (3, 30),
  (4, NULL),
  (5, 10),
  (6, 20),
  (7, 10),
  (8, 20);

INSERT INTO t2 VALUES
  (1, 100),
  (2, 200),
  (5, 500),
  (6, 600),
  (7, 700),
  (8, 800);

INSERT INTO t3 VALUES
  (10),
  (20);

## Correct result with derived_merge disabled
SET optimizer_switch='hypergraph_optimizer=on,derived_merge=off';

EXPLAIN FORMAT=TREE
SELECT
  t1.a,
  t1.b,
  dt.a,
  dt.b
FROM t1
LEFT JOIN LATERAL (
  SELECT t2.a, t2.b
  FROM t2
  WHERE t1.a = t2.a
    AND EXISTS (
      SELECT 1
      FROM t3
      WHERE t3.a = t1.b
    )
) AS dt ON TRUE
ORDER BY t1.a\G

*************************** 1. row ***************************
EXPLAIN: -> Nested loop left join  (cost=10..35.5 rows=8)
    -> Sort: t1.a  (cost=6.41..6.41 rows=8)
        -> Table scan on t1  (cost=0.359..2.88 rows=8)
    -> Filter: true  (cost=3.63..3.63 rows=0.14)
        -> Stream results  (cost=3.62..3.62 rows=0.14)
            -> Nested loop inner join (FirstMatch)  (cost=3.62..3.62 rows=0.14)
                -> Limit: 1 row(s)  (cost=0.705..0.705 rows=0.2)
                    -> Filter: (t1.b = t3.a)  (cost=0.705..0.705 rows=0.2)
                        -> Table scan on t3  (cost=0.295..0.59 rows=2)
                -> Filter: (t1.a = t2.a)  (cost=2.92..2.92 rows=0.7)
                    -> Table scan on t2  (cost=0.359..2.52 rows=7)

1 row in set, 2 warnings (0.001 sec)

SELECT
  t1.a,
  t1.b,
  dt.a,
  dt.b
FROM t1
LEFT JOIN LATERAL (
  SELECT t2.a, t2.b
  FROM t2
  WHERE t1.a = t2.a
    AND EXISTS (
      SELECT 1
      FROM t3
      WHERE t3.a = t1.b
    )
) AS dt ON TRUE
ORDER BY t1.a;

+------+------+------+------+
| a    | b    | a    | b    |
+------+------+------+------+
|    1 |   10 |    1 |  100 |
|    2 |   20 |    2 |  200 |
|    3 |   30 | NULL | NULL |
|    4 | NULL | NULL | NULL |
|    5 |   10 |    5 |  500 |
|    6 |   20 |    6 |  600 |
|    7 |   10 |    7 |  700 |
|    8 |   20 |    8 |  800 |
+------+------+------+------+
8 rows in set (0.002 sec)

## Wrong result with derived_merge enabled

SET optimizer_switch='hypergraph_optimizer=on,derived_merge=on';

EXPLAIN FORMAT=TREE
SELECT
  t1.a,
  t1.b,
  dt.a,
  dt.b
FROM t1
LEFT JOIN LATERAL (
  SELECT t2.a, t2.b
  FROM t2
  WHERE t1.a = t2.a
    AND EXISTS (
      SELECT 1
      FROM t3
      WHERE t3.a = t1.b
    )
) AS dt ON TRUE
ORDER BY t1.a\G

*************************** 1. row ***************************
EXPLAIN: -> Sort: t1.a  (cost=12.8..12.8 rows=8)
    -> Left hash join (t1.a = t2.a), (t1.b = t3.a)  (cost=5.68..9.3 rows=8)
        -> Table scan on t1  (cost=0.359..2.88 rows=8)
        -> Hash
            -> Nested loop inner join (FirstMatch)  (cost=0.402..2.81 rows=7)
                -> Limit: 1 row(s)  (cost=0.295..0.295 rows=1)
                    -> Table scan on t3  (cost=0.295..0.59 rows=2)
                -> Table scan on t2  (cost=0.359..2.52 rows=7)

1 row in set, 2 warnings (0.001 sec)

SELECT
  t1.a,
  t1.b,
  dt.a,
  dt.b
FROM t1
LEFT JOIN LATERAL (
  SELECT t2.a, t2.b
  FROM t2
  WHERE t1.a = t2.a
    AND EXISTS (
      SELECT 1
      FROM t3
      WHERE t3.a = t1.b
    )
) AS dt ON TRUE
ORDER BY t1.a;

+------+------+------+------+
| a    | b    | a    | b    |
+------+------+------+------+
|    1 |   10 |    1 |  100 |
|    2 |   20 | NULL | NULL |
|    3 |   30 | NULL | NULL |
|    4 | NULL | NULL | NULL |
|    5 |   10 |    5 |  500 |
|    6 |   20 | NULL | NULL |
|    7 |   10 |    7 |  700 |
|    8 |   20 | NULL | NULL |
+------+------+------+------+
8 rows in set (0.001 sec)

Rows with `t1.b = 20` incorrectly lose their lateral match, even though `t3` contains `20` and matching rows exist in `t2`.

Suggested fix:
A conservative fix would be to avoid derived_merge for this unsafe combination under the hypergraph optimizer.
For example, reject merging a derived table when all of the following are true:
- the hypergraph optimizer is used,
- the derived table is on the inner side of an outer join,
- the derived table has lateral dependencies or references outer tables,
- the derived query block contains semijoin nests produced by subquery flattening.
- the semijoin nest depends on the outer tables.

However, within merge_derived(), it does not appear easy to determine "which semijoin nest depends on the outer table."
[17 May 23:17] Roy Lyseng
Thank you for the bug report.
Verified as described.