Bug #120284 Materialized `IN (subquery)` Loses VARCHAR-to-YEAR Coercion,Causing Plan-Dependent Result Divergence
Submitted: 18 Apr 15:46 Modified: 20 Apr 7:33
Reporter: Y F Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S2 (Serious)
Version:8.0.45 OS:Ubuntu (Ubuntu 20.04.4 LTS)
Assigned to: CPU Architecture:x86 (Intel(R) Xeon(R) Platinum 8358P @ 2.60GHz (64 Cores, 96MB L3 Cache))

[18 Apr 15:46] Y F
Description:
This issue can be described using two execution states over the same tables and the same data.

In the baseline execution state S1, after creating the tables, inserting a small amount of data, and enabling the normal semijoin path, the query returns 2 rows:

------------------------------------
mysql> SELECT 'S1' AS stage, COUNT(*) AS cnt
    -> FROM t_main m
    -> WHERE m.c2 = 'hv_7504'
    ->   AND m.c6 IN (
    ->       SELECT r.c4
    ->       FROM t_ref r
    ->       WHERE r.c7 >= 7 OR r.c5 IS NULL
    ->   );
+-------+-----+
| stage | cnt |
+-------+-----+
| S1    | 2   |
+-------+-----+
1 row in set
------------------------------------

The two returned rows are:

------------------------------------
mysql> SELECT 'S1_rows' AS stage, m.c1, m.c2, m.c6
    -> FROM t_main m
    -> WHERE m.c2 = 'hv_7504'
    ->   AND m.c6 IN (
    ->       SELECT r.c4
    ->       FROM t_ref r
    ->       WHERE r.c7 >= 7 OR r.c5 IS NULL
    ->   )
    -> ORDER BY m.c1;
+---------+----+---------+------------+
| stage   | c1 | c2      | c6         |
+---------+----+---------+------------+
| S1_rows | 1  | hv_7504 | not-a-date |
| S1_rows | 3  | hv_7504 | 0          |
+---------+----+---------+------------+
2 rows in set
------------------------------------

Then I keep the table contents unchanged, but switch the optimizer to disable
semijoin and force the non-semijoin `IN (subquery)` path. In this execution
state S2, the same query returns 0 rows:

------------------------------------
mysql> SELECT 'S2' AS stage, COUNT(*) AS cnt
    -> FROM t_main m
    -> WHERE m.c2 = 'hv_7504'
    ->   AND m.c6 IN (
    ->       SELECT r.c4
    ->       FROM t_ref r
    ->       WHERE r.c7 >= 7 OR r.c5 IS NULL
    ->   );
+-------+-----+
| stage | cnt |
+-------+-----+
| S2    | 0   |
+-------+-----+
1 row in set
------------------------------------

Because S1 and S2 use the same tables and the same data, they are logically
equivalent execution states. Therefore the result sets must be identical.
Instead, the observed results are:

------------------------------------
COUNT(S1) = 2
COUNT(S2) = 0
------------------------------------

This violates the expected S1-to-S2 consistency relationship.

The root cause is an optimizer bug in implicit coercion for `IN (subquery)`.
The outer operand is `VARCHAR`, while the subquery returns `YEAR`. Under the
normal semijoin / `FirstMatch` plan in S1, MySQL applies the coercion
correctly: both `'not-a-date'` and `'0'` compare equal to `YEAR 0000`.
However, after switching to the non-semijoin subquery path in S2, those same
rows disappear entirely, showing that the implicit VARCHAR-to-YEAR coercion is
not handled correctly in that execution path.

This is confirmed by a control query:

------------------------------------
mysql> SELECT 'control' AS stage, c1, c4,
    ->        ('not-a-date' = c4) AS eq_bad_str,
    ->        ('0' = c4) AS eq_zero_str
    -> FROM t_ref
    -> WHERE c7 >= 7 OR c5 IS NULL
    -> ORDER BY c1;
+---------+----+------+------------+-------------+
| stage   | c1 | c4   | eq_bad_str | eq_zero_str |
+---------+----+------+------------+-------------+
| control | 1  | 0    | 1          | 1           |
| control | 3  | 2005 | 0          | 0           |
| control | 4  | 2006 | 0          | 0           |
+---------+----+------+------------+-------------+
3 rows in set
------------------------------------

So the comparison semantics themselves are valid. The wrong result appears only
after the optimizer switches to the non-semijoin `IN (subquery)` execution
path.

How to repeat:
SET SESSION sql_mode = '';

DROP DATABASE IF EXISTS repro_year_in_subset;
CREATE DATABASE repro_year_in_subset;
USE repro_year_in_subset;

CREATE TABLE t_main (
    c1 INT PRIMARY KEY,
    c2 VARCHAR(20) NOT NULL,
    c6 VARCHAR(10) NOT NULL
);
CREATE INDEX idx_c2 ON t_main (c2);

CREATE TABLE t_ref (
    c1 INT PRIMARY KEY,
    c4 YEAR NOT NULL,
    c5 TIME NULL,
    c7 SMALLINT NULL
);

-- S1: baseline state
INSERT INTO t_main VALUES
(1, 'hv_7504', 'not-a-date'),
(2, 'hv_7504', '2023-01-01'),
(3, 'hv_7504', '0'),
(4, 'other',   'not-a-date');

-- INSERT IGNORE is important: NULL becomes YEAR 0000
INSERT IGNORE INTO t_ref VALUES
(1, NULL, NULL, NULL),
(2, 2011, '00:00:00', 1),
(3, 2005, '00:00:01', 9),
(4, 2006, '00:00:02', 9);

ANALYZE TABLE t_main, t_ref;

-- Optional control: coercion semantics are correct
SELECT 'control' AS stage, c1, c4,
       ('not-a-date' = c4) AS eq_bad_str,
       ('0' = c4) AS eq_zero_str
FROM t_ref
WHERE c7 >= 7 OR c5 IS NULL
ORDER BY c1;

-- S1: semijoin / FirstMatch path
SET SESSION optimizer_switch =
  'semijoin=on,materialization=on,subquery_materialization_cost_based=on';

EXPLAIN FORMAT=TRADITIONAL
SELECT 'S1' AS stage, COUNT(*) AS cnt
FROM t_main m
WHERE m.c2 = 'hv_7504'
  AND m.c6 IN (
      SELECT r.c4
      FROM t_ref r
      WHERE r.c7 >= 7 OR r.c5 IS NULL
  );

SELECT 'S1' AS stage, COUNT(*) AS cnt
FROM t_main m
WHERE m.c2 = 'hv_7504'
  AND m.c6 IN (
      SELECT r.c4
      FROM t_ref r
      WHERE r.c7 >= 7 OR r.c5 IS NULL
  );

SELECT 'S1_rows' AS stage, m.c1, m.c2, m.c6
FROM t_main m
WHERE m.c2 = 'hv_7504'
  AND m.c6 IN (
      SELECT r.c4
      FROM t_ref r
      WHERE r.c7 >= 7 OR r.c5 IS NULL
  )
ORDER BY m.c1;

-- S2: same data, but force the non-semijoin IN-subquery path
SET SESSION optimizer_switch =
  'semijoin=off,materialization=on,subquery_materialization_cost_based=off';

EXPLAIN FORMAT=TRADITIONAL
SELECT 'S2' AS stage, COUNT(*) AS cnt
FROM t_main m
WHERE m.c2 = 'hv_7504'
  AND m.c6 IN (
      SELECT r.c4
      FROM t_ref r
      WHERE r.c7 >= 7 OR r.c5 IS NULL
  );

SELECT 'S2' AS stage, COUNT(*) AS cnt
FROM t_main m
WHERE m.c2 = 'hv_7504'
  AND m.c6 IN (
      SELECT r.c4
      FROM t_ref r
      WHERE r.c7 >= 7 OR r.c5 IS NULL
  );

SELECT 'S2_rows' AS stage, m.c1, m.c2, m.c6
FROM t_main m
WHERE m.c2 = 'hv_7504'
  AND m.c6 IN (
      SELECT r.c4
      FROM t_ref r
      WHERE r.c7 >= 7 OR r.c5 IS NULL
  )
ORDER BY m.c1;

-- Cross-check: EXISTS remains correct under the forced S2 setting
EXPLAIN FORMAT=TRADITIONAL
SELECT 'S2_exists' AS stage, COUNT(*) AS cnt
FROM t_main m
WHERE m.c2 = 'hv_7504'
  AND EXISTS (
      SELECT 1
      FROM t_ref r
      WHERE (r.c7 >= 7 OR r.c5 IS NULL)
        AND m.c6 = r.c4
  );

SELECT 'S2_exists' AS stage, COUNT(*) AS cnt
FROM t_main m
WHERE m.c2 = 'hv_7504'
  AND EXISTS (
      SELECT 1
      FROM t_ref r
      WHERE (r.c7 >= 7 OR r.c5 IS NULL)
        AND m.c6 = r.c4
  );
[20 Apr 7:33] Roy Lyseng
Thank you for the bug report.
Verified as described.