Bug #120296 Materialized Correlated EXISTS Loses Implicit VARCHAR-to-YEAR Coercion,Causing Plan-Dependent Result Divergence
Submitted: 20 Apr 7:50 Modified: 20 Apr 9:24
Reporter: Y F Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S2 (Serious)
Version:8.0.45 OS:Ubuntu (Ubuntu 20.04.4 LTS)
Assigned to: CPU Architecture:x86 (Intel(R) Xeon(R) Platinum 8358P @ 2.60GHz (64 Cores, 96MB L3 Cache))

[20 Apr 7:50] Y F
Description:
This issue can be described using two execution states over the same tables
and the same data.

In the baseline execution state S1, after creating the tables, inserting a
small amount of data, and forcing the FirstMatch semijoin path, the query
returns 4 rows:

------------------------------------
mysql> SELECT 'S1_count' AS stage, COUNT(*) AS cnt
    -> FROM t_main m
    -> WHERE m.c5 IS NULL
    ->   AND m.c4 = 8
    ->   AND LENGTH(COALESCE(m.c3, '')) >= 7
    ->   AND EXISTS (
    ->       SELECT 1
    ->       FROM t_ref r
    ->       WHERE (r.c13 IS NULL OR r.c11 IS NOT NULL)
    ->         AND r.c5 IS NOT NULL
    ->         AND m.c2 = r.c4
    ->   );
+----------+-----+
| stage    | cnt |
+----------+-----+
| S1_count | 4   |
+----------+-----+
1 row in set
------------------------------------

The 4 returned rows are:

------------------------------------
mysql> SELECT 'S1_rows' AS stage, m.c1, m.c2, m.c4
    -> FROM t_main m
    -> WHERE m.c5 IS NULL
    ->   AND m.c4 = 8
    ->   AND LENGTH(COALESCE(m.c3, '')) >= 7
    ->   AND EXISTS (
    ->       SELECT 1
    ->       FROM t_ref r
    ->       WHERE (r.c13 IS NULL OR r.c11 IS NOT NULL)
    ->         AND r.c5 IS NOT NULL
    ->         AND m.c2 = r.c4
    ->   )
    -> ORDER BY m.c1;
+---------+----+-------------------+----+
| stage   | c1 | c2                | c4 |
+---------+----+-------------------+----+
| S1_rows | 1  | not-a-date        | 8  |
| S1_rows | 2  | hv_820            | 8  |
| S1_rows | 3  | k9zs-031qxa86z954 | 8  |
| S1_rows | 4  | dldk4w-7o6        | 8  |
+---------+----+-------------------+----+
4 rows in set
------------------------------------

Then I keep the table contents unchanged, but switch the optimizer to disable
FirstMatch and force the materialized semijoin path. In this execution state
S2, the same query returns 0 rows:

------------------------------------
mysql> SELECT 'S2_count' AS stage, COUNT(*) AS cnt
    -> FROM t_main m
    -> WHERE m.c5 IS NULL
    ->   AND m.c4 = 8
    ->   AND LENGTH(COALESCE(m.c3, '')) >= 7
    ->   AND EXISTS (
    ->       SELECT 1
    ->       FROM t_ref r
    ->       WHERE (r.c13 IS NULL OR r.c11 IS NOT NULL)
    ->         AND r.c5 IS NOT NULL
    ->         AND m.c2 = r.c4
    ->   );
+----------+-----+
| stage    | cnt |
+----------+-----+
| S2_count | 0   |
+----------+-----+
1 row in set
------------------------------------

Because S1 and S2 use the same tables and the same data, they are logically
equivalent execution states. Therefore the result sets must be identical.
Instead, the observed results are:

------------------------------------
COUNT(S1) = 4
COUNT(S2) = 0
------------------------------------

This violates the expected S1-to-S2 consistency relationship.

The root cause is an optimizer bug in implicit coercion for correlated
`EXISTS`. The outer operand is `VARCHAR`, while the correlated subquery
compares it to a `YEAR` column. Under the FirstMatch semijoin path in S1,
MySQL applies the coercion correctly, so strings such as `'not-a-date'`,
`'hv_820'`, and other non-numeric values compare equal to `YEAR 0000`.
However, after switching to the materialized semijoin path in S2, those same
rows disappear entirely, showing that the implicit VARCHAR-to-YEAR coercion is
not handled correctly in that execution path.

This is confirmed by the control query:

------------------------------------
mysql> SELECT 'control' AS stage, c1, c4, c5,
    ->        ('not-a-date' = c4) AS eq_bad_str,
    ->        ('0' = c4) AS eq_zero_str,
    ->        (CAST('not-a-date' AS SIGNED) = c4) AS eq_cast_bad_str
    -> FROM t_ref
    -> ORDER BY c1;
+---------+----+------+----------+------------+-------------+-----------------+
| stage   | c1 | c4   | c5       | eq_bad_str | eq_zero_str | eq_cast_bad_str |
+---------+----+------+----------+------------+-------------+-----------------+
| control | 1  | 0    | 15:32:26 | 1          | 0           | 1               |
| control | 2  | 2008 | 2:57:30  | 0          | 0           | 0               |
| control | 3  | 2017 | 14:12:33 | 0          | 0           | 0               |
+---------+----+------+----------+------------+-------------+-----------------+
3 rows in set
------------------------------------

And under the same S2 setting, making the coercion explicit restores all 4
rows:

------------------------------------
mysql> SELECT 'S2_cast_count' AS stage, COUNT(*) AS cnt
    -> FROM t_main m
    -> WHERE m.c5 IS NULL
    ->   AND m.c4 = 8
    ->   AND LENGTH(COALESCE(m.c3, '')) >= 7
    ->   AND EXISTS (
    ->       SELECT 1
    ->       FROM t_ref r
    ->       WHERE (r.c13 IS NULL OR r.c11 IS NOT NULL)
    ->         AND r.c5 IS NOT NULL
    ->         AND CAST(m.c2 AS SIGNED) = r.c4
    ->   );
+---------------+-----+
| stage         | cnt |
+---------------+-----+
| S2_cast_count | 4   |
+---------------+-----+
1 row in set
------------------------------------

So the wrong result is specific to the implicit `VARCHAR = YEAR` comparison in
the materialized correlated `EXISTS` path, not to the underlying data.

How to repeat:
SET SESSION sql_mode = '';

DROP DATABASE IF EXISTS repro_exists_year_materialized;
CREATE DATABASE repro_exists_year_materialized;
USE repro_exists_year_materialized;

CREATE TABLE t_main (
    c1 INT NOT NULL PRIMARY KEY,
    c2 VARCHAR(64) NOT NULL,
    c3 VARCHAR(64) NULL,
    c4 INT NULL,
    c5 DATE NULL
);

CREATE TABLE t_ref (
    c1 INT NOT NULL PRIMARY KEY,
    c4 YEAR NOT NULL,
    c5 TIME NULL,
    c11 GEOMETRY NULL,
    c13 TINYBLOB NULL
);

CREATE INDEX idx_main_c2 ON t_main (c2);
CREATE INDEX idx_main_c4 ON t_main (c4);

INSERT INTO t_main (c1, c2, c3, c4, c5) VALUES
(1, 'not-a-date',        'hv_4144',     8, NULL),
(2, 'hv_820',            'not-a-date',  8, NULL),
(3, 'k9zs-031qxa86z954', 'hv_4144',     8, NULL),
(4, 'dldk4w-7o6',        'hv_4144',     8, NULL),
(5, '123',               'hv_4144',     8, NULL),
(6, 'not-a-date',        'short',       8, NULL),
(7, 'not-a-date',        'hv_4144',    12, NULL),
(8, 'other',             'hv_4144',     8, '2023-01-01');

-- INSERT IGNORE is important: NULL becomes YEAR 0000
INSERT IGNORE INTO t_ref (c1, c4, c5, c11, c13) VALUES
(1, NULL, '15:32:26', NULL, NULL),
(2, 2008, '02:57:30', NULL, X'01'),
(3, 2017, '14:12:33', NULL, X'02');

ANALYZE TABLE t_main, t_ref;

-- Optional control: coercion semantics are valid
SELECT 'control' AS stage, c1, c4, c5,
       ('not-a-date' = c4) AS eq_bad_str,
       ('0' = c4) AS eq_zero_str,
       (CAST('not-a-date' AS SIGNED) = c4) AS eq_cast_bad_str
FROM t_ref
ORDER BY c1;

-- S1: FirstMatch semijoin path
SET SESSION optimizer_switch =
  'semijoin=on,firstmatch=on,materialization=off,loosescan=off,duplicateweedout=off';

EXPLAIN FORMAT=TRADITIONAL
SELECT m.c1, m.c2
FROM t_main m
WHERE m.c5 IS NULL
  AND m.c4 = 8
  AND LENGTH(COALESCE(m.c3, '')) >= 7
  AND EXISTS (
      SELECT 1
      FROM t_ref r
      WHERE (r.c13 IS NULL OR r.c11 IS NOT NULL)
        AND r.c5 IS NOT NULL
        AND m.c2 = r.c4
  );

SELECT 'S1_count' AS stage, COUNT(*) AS cnt
FROM t_main m
WHERE m.c5 IS NULL
  AND m.c4 = 8
  AND LENGTH(COALESCE(m.c3, '')) >= 7
  AND EXISTS (
      SELECT 1
      FROM t_ref r
      WHERE (r.c13 IS NULL OR r.c11 IS NOT NULL)
        AND r.c5 IS NOT NULL
        AND m.c2 = r.c4
  );

SELECT 'S1_rows' AS stage, m.c1, m.c2, m.c4
FROM t_main m
WHERE m.c5 IS NULL
  AND m.c4 = 8
  AND LENGTH(COALESCE(m.c3, '')) >= 7
  AND EXISTS (
      SELECT 1
      FROM t_ref r
      WHERE (r.c13 IS NULL OR r.c11 IS NOT NULL)
        AND r.c5 IS NOT NULL
        AND m.c2 = r.c4
  )
ORDER BY m.c1;

-- S2: same data, but force the materialized semijoin path
SET SESSION optimizer_switch =
  'semijoin=on,firstmatch=off,materialization=on,loosescan=off,duplicateweedout=off';

EXPLAIN FORMAT=TRADITIONAL
SELECT m.c1, m.c2
FROM t_main m
WHERE m.c5 IS NULL
  AND m.c4 = 8
  AND LENGTH(COALESCE(m.c3, '')) >= 7
  AND EXISTS (
      SELECT 1
      FROM t_ref r
      WHERE (r.c13 IS NULL OR r.c11 IS NOT NULL)
        AND r.c5 IS NOT NULL
        AND m.c2 = r.c4
  );

SELECT 'S2_count' AS stage, COUNT(*) AS cnt
FROM t_main m
WHERE m.c5 IS NULL
  AND m.c4 = 8
  AND LENGTH(COALESCE(m.c3, '')) >= 7
  AND EXISTS (
      SELECT 1
      FROM t_ref r
      WHERE (r.c13 IS NULL OR r.c11 IS NOT NULL)
        AND r.c5 IS NOT NULL
        AND m.c2 = r.c4
  );

SELECT 'S2_rows' AS stage, m.c1, m.c2, m.c4
FROM t_main m
WHERE m.c5 IS NULL
  AND m.c4 = 8
  AND LENGTH(COALESCE(m.c3, '')) >= 7
  AND EXISTS (
      SELECT 1
      FROM t_ref r
      WHERE (r.c13 IS NULL OR r.c11 IS NOT NULL)
        AND r.c5 IS NOT NULL
        AND m.c2 = r.c4
  )
ORDER BY m.c1;

-- Cross-check: explicit coercion restores the rows under S2
SELECT 'S2_cast_count' AS stage, COUNT(*) AS cnt
FROM t_main m
WHERE m.c5 IS NULL
  AND m.c4 = 8
  AND LENGTH(COALESCE(m.c3, '')) >= 7
  AND EXISTS (
      SELECT 1
      FROM t_ref r
      WHERE (r.c13 IS NULL OR r.c11 IS NOT NULL)
        AND r.c5 IS NOT NULL
        AND CAST(m.c2 AS SIGNED) = r.c4
  );

SELECT 'S2_cast_rows' AS stage, m.c1, m.c2, m.c4
FROM t_main m
WHERE m.c5 IS NULL
  AND m.c4 = 8
  AND LENGTH(COALESCE(m.c3, '')) >= 7
  AND EXISTS (
      SELECT 1
      FROM t_ref r
      WHERE (r.c13 IS NULL OR r.c11 IS NOT NULL)
        AND r.c5 IS NOT NULL
        AND CAST(m.c2 AS SIGNED) = r.c4
  )
ORDER BY m.c1;
[20 Apr 9:24] Roy Lyseng
Thank you for the bug report.
Verified as described.