Description:
This issue can be described using two execution states over the same tables and the same data.
In the baseline execution state S1, after creating the tables, inserting a small amount of data, and enabling the normal semijoin path, the query returns 2 rows:
------------------------------------
mysql> SELECT 'S1' AS stage, COUNT(*) AS cnt
-> FROM t_main m
-> WHERE m.c2 = 'hv_7504'
-> AND m.c6 IN (
-> SELECT r.c4
-> FROM t_ref r
-> WHERE r.c7 >= 7 OR r.c5 IS NULL
-> );
+-------+-----+
| stage | cnt |
+-------+-----+
| S1 | 2 |
+-------+-----+
1 row in set
------------------------------------
The two returned rows are:
------------------------------------
mysql> SELECT 'S1_rows' AS stage, m.c1, m.c2, m.c6
-> FROM t_main m
-> WHERE m.c2 = 'hv_7504'
-> AND m.c6 IN (
-> SELECT r.c4
-> FROM t_ref r
-> WHERE r.c7 >= 7 OR r.c5 IS NULL
-> )
-> ORDER BY m.c1;
+---------+----+---------+------------+
| stage | c1 | c2 | c6 |
+---------+----+---------+------------+
| S1_rows | 1 | hv_7504 | not-a-date |
| S1_rows | 3 | hv_7504 | 0 |
+---------+----+---------+------------+
2 rows in set
------------------------------------
Then I keep the table contents unchanged, but switch the optimizer to disable
semijoin and force the non-semijoin `IN (subquery)` path. In this execution
state S2, the same query returns 0 rows:
------------------------------------
mysql> SELECT 'S2' AS stage, COUNT(*) AS cnt
-> FROM t_main m
-> WHERE m.c2 = 'hv_7504'
-> AND m.c6 IN (
-> SELECT r.c4
-> FROM t_ref r
-> WHERE r.c7 >= 7 OR r.c5 IS NULL
-> );
+-------+-----+
| stage | cnt |
+-------+-----+
| S2 | 0 |
+-------+-----+
1 row in set
------------------------------------
Because S1 and S2 use the same tables and the same data, they are logically
equivalent execution states. Therefore the result sets must be identical.
Instead, the observed results are:
------------------------------------
COUNT(S1) = 2
COUNT(S2) = 0
------------------------------------
This violates the expected S1-to-S2 consistency relationship.
The root cause is an optimizer bug in implicit coercion for `IN (subquery)`.
The outer operand is `VARCHAR`, while the subquery returns `YEAR`. Under the
normal semijoin / `FirstMatch` plan in S1, MySQL applies the coercion
correctly: both `'not-a-date'` and `'0'` compare equal to `YEAR 0000`.
However, after switching to the non-semijoin subquery path in S2, those same
rows disappear entirely, showing that the implicit VARCHAR-to-YEAR coercion is
not handled correctly in that execution path.
This is confirmed by a control query:
------------------------------------
mysql> SELECT 'control' AS stage, c1, c4,
-> ('not-a-date' = c4) AS eq_bad_str,
-> ('0' = c4) AS eq_zero_str
-> FROM t_ref
-> WHERE c7 >= 7 OR c5 IS NULL
-> ORDER BY c1;
+---------+----+------+------------+-------------+
| stage | c1 | c4 | eq_bad_str | eq_zero_str |
+---------+----+------+------------+-------------+
| control | 1 | 0 | 1 | 1 |
| control | 3 | 2005 | 0 | 0 |
| control | 4 | 2006 | 0 | 0 |
+---------+----+------+------------+-------------+
3 rows in set
------------------------------------
So the comparison semantics themselves are valid. The wrong result appears only
after the optimizer switches to the non-semijoin `IN (subquery)` execution
path.
How to repeat:
SET SESSION sql_mode = '';
DROP DATABASE IF EXISTS repro_year_in_subset;
CREATE DATABASE repro_year_in_subset;
USE repro_year_in_subset;
CREATE TABLE t_main (
c1 INT PRIMARY KEY,
c2 VARCHAR(20) NOT NULL,
c6 VARCHAR(10) NOT NULL
);
CREATE INDEX idx_c2 ON t_main (c2);
CREATE TABLE t_ref (
c1 INT PRIMARY KEY,
c4 YEAR NOT NULL,
c5 TIME NULL,
c7 SMALLINT NULL
);
-- S1: baseline state
INSERT INTO t_main VALUES
(1, 'hv_7504', 'not-a-date'),
(2, 'hv_7504', '2023-01-01'),
(3, 'hv_7504', '0'),
(4, 'other', 'not-a-date');
-- INSERT IGNORE is important: NULL becomes YEAR 0000
INSERT IGNORE INTO t_ref VALUES
(1, NULL, NULL, NULL),
(2, 2011, '00:00:00', 1),
(3, 2005, '00:00:01', 9),
(4, 2006, '00:00:02', 9);
ANALYZE TABLE t_main, t_ref;
-- Optional control: coercion semantics are correct
SELECT 'control' AS stage, c1, c4,
('not-a-date' = c4) AS eq_bad_str,
('0' = c4) AS eq_zero_str
FROM t_ref
WHERE c7 >= 7 OR c5 IS NULL
ORDER BY c1;
-- S1: semijoin / FirstMatch path
SET SESSION optimizer_switch =
'semijoin=on,materialization=on,subquery_materialization_cost_based=on';
EXPLAIN FORMAT=TRADITIONAL
SELECT 'S1' AS stage, COUNT(*) AS cnt
FROM t_main m
WHERE m.c2 = 'hv_7504'
AND m.c6 IN (
SELECT r.c4
FROM t_ref r
WHERE r.c7 >= 7 OR r.c5 IS NULL
);
SELECT 'S1' AS stage, COUNT(*) AS cnt
FROM t_main m
WHERE m.c2 = 'hv_7504'
AND m.c6 IN (
SELECT r.c4
FROM t_ref r
WHERE r.c7 >= 7 OR r.c5 IS NULL
);
SELECT 'S1_rows' AS stage, m.c1, m.c2, m.c6
FROM t_main m
WHERE m.c2 = 'hv_7504'
AND m.c6 IN (
SELECT r.c4
FROM t_ref r
WHERE r.c7 >= 7 OR r.c5 IS NULL
)
ORDER BY m.c1;
-- S2: same data, but force the non-semijoin IN-subquery path
SET SESSION optimizer_switch =
'semijoin=off,materialization=on,subquery_materialization_cost_based=off';
EXPLAIN FORMAT=TRADITIONAL
SELECT 'S2' AS stage, COUNT(*) AS cnt
FROM t_main m
WHERE m.c2 = 'hv_7504'
AND m.c6 IN (
SELECT r.c4
FROM t_ref r
WHERE r.c7 >= 7 OR r.c5 IS NULL
);
SELECT 'S2' AS stage, COUNT(*) AS cnt
FROM t_main m
WHERE m.c2 = 'hv_7504'
AND m.c6 IN (
SELECT r.c4
FROM t_ref r
WHERE r.c7 >= 7 OR r.c5 IS NULL
);
SELECT 'S2_rows' AS stage, m.c1, m.c2, m.c6
FROM t_main m
WHERE m.c2 = 'hv_7504'
AND m.c6 IN (
SELECT r.c4
FROM t_ref r
WHERE r.c7 >= 7 OR r.c5 IS NULL
)
ORDER BY m.c1;
-- Cross-check: EXISTS remains correct under the forced S2 setting
EXPLAIN FORMAT=TRADITIONAL
SELECT 'S2_exists' AS stage, COUNT(*) AS cnt
FROM t_main m
WHERE m.c2 = 'hv_7504'
AND EXISTS (
SELECT 1
FROM t_ref r
WHERE (r.c7 >= 7 OR r.c5 IS NULL)
AND m.c6 = r.c4
);
SELECT 'S2_exists' AS stage, COUNT(*) AS cnt
FROM t_main m
WHERE m.c2 = 'hv_7504'
AND EXISTS (
SELECT 1
FROM t_ref r
WHERE (r.c7 >= 7 OR r.c5 IS NULL)
AND m.c6 = r.c4
);
Description: This issue can be described using two execution states over the same tables and the same data. In the baseline execution state S1, after creating the tables, inserting a small amount of data, and enabling the normal semijoin path, the query returns 2 rows: ------------------------------------ mysql> SELECT 'S1' AS stage, COUNT(*) AS cnt -> FROM t_main m -> WHERE m.c2 = 'hv_7504' -> AND m.c6 IN ( -> SELECT r.c4 -> FROM t_ref r -> WHERE r.c7 >= 7 OR r.c5 IS NULL -> ); +-------+-----+ | stage | cnt | +-------+-----+ | S1 | 2 | +-------+-----+ 1 row in set ------------------------------------ The two returned rows are: ------------------------------------ mysql> SELECT 'S1_rows' AS stage, m.c1, m.c2, m.c6 -> FROM t_main m -> WHERE m.c2 = 'hv_7504' -> AND m.c6 IN ( -> SELECT r.c4 -> FROM t_ref r -> WHERE r.c7 >= 7 OR r.c5 IS NULL -> ) -> ORDER BY m.c1; +---------+----+---------+------------+ | stage | c1 | c2 | c6 | +---------+----+---------+------------+ | S1_rows | 1 | hv_7504 | not-a-date | | S1_rows | 3 | hv_7504 | 0 | +---------+----+---------+------------+ 2 rows in set ------------------------------------ Then I keep the table contents unchanged, but switch the optimizer to disable semijoin and force the non-semijoin `IN (subquery)` path. In this execution state S2, the same query returns 0 rows: ------------------------------------ mysql> SELECT 'S2' AS stage, COUNT(*) AS cnt -> FROM t_main m -> WHERE m.c2 = 'hv_7504' -> AND m.c6 IN ( -> SELECT r.c4 -> FROM t_ref r -> WHERE r.c7 >= 7 OR r.c5 IS NULL -> ); +-------+-----+ | stage | cnt | +-------+-----+ | S2 | 0 | +-------+-----+ 1 row in set ------------------------------------ Because S1 and S2 use the same tables and the same data, they are logically equivalent execution states. Therefore the result sets must be identical. Instead, the observed results are: ------------------------------------ COUNT(S1) = 2 COUNT(S2) = 0 ------------------------------------ This violates the expected S1-to-S2 consistency relationship. The root cause is an optimizer bug in implicit coercion for `IN (subquery)`. The outer operand is `VARCHAR`, while the subquery returns `YEAR`. Under the normal semijoin / `FirstMatch` plan in S1, MySQL applies the coercion correctly: both `'not-a-date'` and `'0'` compare equal to `YEAR 0000`. However, after switching to the non-semijoin subquery path in S2, those same rows disappear entirely, showing that the implicit VARCHAR-to-YEAR coercion is not handled correctly in that execution path. This is confirmed by a control query: ------------------------------------ mysql> SELECT 'control' AS stage, c1, c4, -> ('not-a-date' = c4) AS eq_bad_str, -> ('0' = c4) AS eq_zero_str -> FROM t_ref -> WHERE c7 >= 7 OR c5 IS NULL -> ORDER BY c1; +---------+----+------+------------+-------------+ | stage | c1 | c4 | eq_bad_str | eq_zero_str | +---------+----+------+------------+-------------+ | control | 1 | 0 | 1 | 1 | | control | 3 | 2005 | 0 | 0 | | control | 4 | 2006 | 0 | 0 | +---------+----+------+------------+-------------+ 3 rows in set ------------------------------------ So the comparison semantics themselves are valid. The wrong result appears only after the optimizer switches to the non-semijoin `IN (subquery)` execution path. How to repeat: SET SESSION sql_mode = ''; DROP DATABASE IF EXISTS repro_year_in_subset; CREATE DATABASE repro_year_in_subset; USE repro_year_in_subset; CREATE TABLE t_main ( c1 INT PRIMARY KEY, c2 VARCHAR(20) NOT NULL, c6 VARCHAR(10) NOT NULL ); CREATE INDEX idx_c2 ON t_main (c2); CREATE TABLE t_ref ( c1 INT PRIMARY KEY, c4 YEAR NOT NULL, c5 TIME NULL, c7 SMALLINT NULL ); -- S1: baseline state INSERT INTO t_main VALUES (1, 'hv_7504', 'not-a-date'), (2, 'hv_7504', '2023-01-01'), (3, 'hv_7504', '0'), (4, 'other', 'not-a-date'); -- INSERT IGNORE is important: NULL becomes YEAR 0000 INSERT IGNORE INTO t_ref VALUES (1, NULL, NULL, NULL), (2, 2011, '00:00:00', 1), (3, 2005, '00:00:01', 9), (4, 2006, '00:00:02', 9); ANALYZE TABLE t_main, t_ref; -- Optional control: coercion semantics are correct SELECT 'control' AS stage, c1, c4, ('not-a-date' = c4) AS eq_bad_str, ('0' = c4) AS eq_zero_str FROM t_ref WHERE c7 >= 7 OR c5 IS NULL ORDER BY c1; -- S1: semijoin / FirstMatch path SET SESSION optimizer_switch = 'semijoin=on,materialization=on,subquery_materialization_cost_based=on'; EXPLAIN FORMAT=TRADITIONAL SELECT 'S1' AS stage, COUNT(*) AS cnt FROM t_main m WHERE m.c2 = 'hv_7504' AND m.c6 IN ( SELECT r.c4 FROM t_ref r WHERE r.c7 >= 7 OR r.c5 IS NULL ); SELECT 'S1' AS stage, COUNT(*) AS cnt FROM t_main m WHERE m.c2 = 'hv_7504' AND m.c6 IN ( SELECT r.c4 FROM t_ref r WHERE r.c7 >= 7 OR r.c5 IS NULL ); SELECT 'S1_rows' AS stage, m.c1, m.c2, m.c6 FROM t_main m WHERE m.c2 = 'hv_7504' AND m.c6 IN ( SELECT r.c4 FROM t_ref r WHERE r.c7 >= 7 OR r.c5 IS NULL ) ORDER BY m.c1; -- S2: same data, but force the non-semijoin IN-subquery path SET SESSION optimizer_switch = 'semijoin=off,materialization=on,subquery_materialization_cost_based=off'; EXPLAIN FORMAT=TRADITIONAL SELECT 'S2' AS stage, COUNT(*) AS cnt FROM t_main m WHERE m.c2 = 'hv_7504' AND m.c6 IN ( SELECT r.c4 FROM t_ref r WHERE r.c7 >= 7 OR r.c5 IS NULL ); SELECT 'S2' AS stage, COUNT(*) AS cnt FROM t_main m WHERE m.c2 = 'hv_7504' AND m.c6 IN ( SELECT r.c4 FROM t_ref r WHERE r.c7 >= 7 OR r.c5 IS NULL ); SELECT 'S2_rows' AS stage, m.c1, m.c2, m.c6 FROM t_main m WHERE m.c2 = 'hv_7504' AND m.c6 IN ( SELECT r.c4 FROM t_ref r WHERE r.c7 >= 7 OR r.c5 IS NULL ) ORDER BY m.c1; -- Cross-check: EXISTS remains correct under the forced S2 setting EXPLAIN FORMAT=TRADITIONAL SELECT 'S2_exists' AS stage, COUNT(*) AS cnt FROM t_main m WHERE m.c2 = 'hv_7504' AND EXISTS ( SELECT 1 FROM t_ref r WHERE (r.c7 >= 7 OR r.c5 IS NULL) AND m.c6 = r.c4 ); SELECT 'S2_exists' AS stage, COUNT(*) AS cnt FROM t_main m WHERE m.c2 = 'hv_7504' AND EXISTS ( SELECT 1 FROM t_ref r WHERE (r.c7 >= 7 OR r.c5 IS NULL) AND m.c6 = r.c4 );