Description:
When a query containing DISTINCT (or DISTINCTROW), GROUP BY, and SUM(DISTINCT EXISTS(subquery)) is rewritten as a derived table (or CTE), the optimizer, with derived_merge=on (default), merges the derived table into the outer query and adds a Sort with duplicate removal step that includes the aggregate column c (the result of SUM(DISTINCT EXISTS(...)) in the duplicate‑removal key.
Because this aggregate is constant per group (e.g., 1), but the grouping columns (a and b) may differ (NULL vs 0), the duplicate removal incorrectly treats these rows as duplicates and keeps only one, leading to missing rows.
Disabling derived_merge restores the correct result, while disabling hash_join does not, confirming the issue lies in query rewriting rather than join algorithms.
Actual vs Expected Results
Case Query Type Result Comparison (Actual → Expected)
1 Original single‑table 2 rows: (NULL,NULL,1), (NULL,0,1) → OK
1 Derived table 1 row: (NULL,NULL,1) → FAIL (missing (NULL,0,1))
2 Original single‑table 2 rows: (0,NULL,0), (0,1,0) → OK
2 CTE 1 row: (0,NULL,0) → FAIL (missing (0,1,0))
EXPLAIN Analysis
Case 1 (derived table version) key output:
text
Sort with duplicate removal: a, b, c
-> Stream results
-> Group aggregate: sum(distinct exists(select #2))
-> Sort: a, ((l.c = l.id) and (0 <> ifnull(l.c,1767027735)) and ...)
-> Inner hash join ...
Top‑level Sort with duplicate removal: a, b, c shows that the aggregate column c is included in the duplicate‑removal key.
Normally, DISTINCT should only deduplicate on non‑aggregated SELECT columns (here a and b). Including the aggregate column c is incorrect because aggregates do not determine row uniqueness at the group level.
Case 2 (CTE version) similar output:
text
Sort with duplicate removal: a, b, c
-> Group aggregate: sum(distinct exists(select #2))
-> Sort: a, (((NULL in (NULL,l.id)) is true) or (0 <> greatest(l.c,...)))
Again, c is erroneously part of the duplicate‑removal key.
Cause Analysis
derived_merge mechanism: When the outer query references a derived table (or CTE), the optimizer attempts to merge it into the outer query to eliminate the temporary table. After merging, it must re‑handle DISTINCT, GROUP BY, and aggregate functions.
Flawed rewrite logic: During merging, the optimizer incorrectly decides that duplicate removal should apply to all SELECT items, including the aggregate column c, thus adding Sort with duplicate removal: a, b, c at the top level.
Semantic conflict: SUM(DISTINCT EXISTS(...)) is an aggregate that produces a single value per group (here constant 1 or 0). This value should not be part of the duplicate‑removal key, because deduplication is performed after grouping, where each group is already unique by the GROUP BY columns. However, the optimizer includes the aggregate column in the key, causing rows that differ only in b to be considered duplicates when a and c match.
Why disabling derived_merge fixes it: When derived_merge=off, the derived table is materialized as a temporary table. The outer query treats it as a plain table, and the optimizer does not attempt to combine the outer DISTINCT with the inner grouping and aggregation. Thus, the erroneous duplicate‑removal key is not constructed; the standard semantics of GROUP BY and DISTINCT are preserved.
How to repeat:
Case 1 (Derived Table)
DROP DATABASE IF EXISTS vp_min12;
CREATE DATABASE vp_min12;
USE vp_min12;
CREATE TABLE s (
id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
c FLOAT NULL
) ENGINE=InnoDB;
INSERT INTO s(c) VALUES (NULL), (0.524686);
CREATE TABLE l (
id BIGINT NOT NULL PRIMARY KEY,
c FLOAT NULL
) ENGINE=InnoDB;
CREATE TABLE r (
id BIGINT NOT NULL PRIMARY KEY
) ENGINE=InnoDB;
INSERT INTO l SELECT id, c FROM s;
INSERT INTO r SELECT id FROM s;
-- Original single‑table query: returns 2 rows
SELECT DISTINCT
LEAST((! s.c), s.id NOT IN (NULL, NULL, s.c)) AS a,
(((s.c) IN (s.id)) AND (IFNULL(s.c, 1767027735))) AND (0.07523026689631984) AS b,
SUM(DISTINCT EXISTS (SELECT 1)) AS c
FROM s
GROUP BY
LEAST((NOT s.c), s.id NOT IN (NULL, NULL, s.c)),
(((s.c) IN (s.id)) AND (IFNULL(s.c, 1767027735))) AND (0.07523026689631984);
-- Derived‑table rewrite (split): returns only 1 row (incorrect)
SELECT DISTINCT
LEAST((! x.c), x.id NOT IN (NULL, NULL, x.c)) AS a,
(((x.c) IN (x.id)) AND (IFNULL(x.c, 1767027735))) AND (0.07523026689631984) AS b,
SUM(DISTINCT EXISTS (SELECT 1)) AS c
FROM (
SELECT l.id, l.c
FROM (SELECT * FROM l) l
JOIN (SELECT * FROM r) r ON l.id = r.id
) x
GROUP BY
LEAST((NOT x.c), x.id NOT IN (NULL, NULL, x.c)),
(((x.c) IN (x.id)) AND (IFNULL(x.c, 1767027735))) AND (0.07523026689631984);
Case 2 (CTE)
DROP DATABASE IF EXISTS vp_min4;
CREATE DATABASE vp_min4;
USE vp_min4;
CREATE TABLE s (
id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
c DECIMAL(65,30) NULL
) ENGINE=InnoDB;
INSERT INTO s(c) VALUES (NULL), (1);
CREATE TABLE lt (
id BIGINT NOT NULL PRIMARY KEY,
c DECIMAL(65,30) NULL
) ENGINE=InnoDB;
CREATE TABLE rt (
id BIGINT NOT NULL PRIMARY KEY
) ENGINE=InnoDB;
INSERT INTO lt SELECT id, c FROM s;
INSERT INTO rt SELECT id FROM s;
-- Original query: returns 2 rows
SELECT DISTINCTROW
(s.id) IS NOT TRUE AS a,
(((NULL) IN (NULL, s.id)) IS TRUE) || ((+ (GREATEST(s.c, 0.4379614263980073)))) AS b,
SUM(DISTINCT EXISTS (SELECT 1 WHERE FALSE)) AS c
FROM s
GROUP BY
(s.id) IS NOT TRUE,
(((NULL) IN (NULL, s.id)) IS TRUE) || ((+ (GREATEST(s.c, 0.4379614263980073))));
-- CTE rewrite: returns only 1 row (incorrect)
WITH
l AS (SELECT * FROM lt),
r AS (SELECT * FROM rt)
SELECT DISTINCTROW
(l.id) IS NOT TRUE AS a,
(((NULL) IN (NULL, l.id)) IS TRUE) || ((+ (GREATEST(l.c, 0.4379614263980073)))) AS b,
SUM(DISTINCT EXISTS (SELECT 1 WHERE FALSE)) AS c
FROM l JOIN r ON l.id = r.id
GROUP BY
(l.id) IS NOT TRUE,
(((NULL) IN (NULL, l.id)) IS TRUE) || ((+ (GREATEST(l.c, 0.4379614263980073))));
Description: When a query containing DISTINCT (or DISTINCTROW), GROUP BY, and SUM(DISTINCT EXISTS(subquery)) is rewritten as a derived table (or CTE), the optimizer, with derived_merge=on (default), merges the derived table into the outer query and adds a Sort with duplicate removal step that includes the aggregate column c (the result of SUM(DISTINCT EXISTS(...)) in the duplicate‑removal key. Because this aggregate is constant per group (e.g., 1), but the grouping columns (a and b) may differ (NULL vs 0), the duplicate removal incorrectly treats these rows as duplicates and keeps only one, leading to missing rows. Disabling derived_merge restores the correct result, while disabling hash_join does not, confirming the issue lies in query rewriting rather than join algorithms. Actual vs Expected Results Case Query Type Result Comparison (Actual → Expected) 1 Original single‑table 2 rows: (NULL,NULL,1), (NULL,0,1) → OK 1 Derived table 1 row: (NULL,NULL,1) → FAIL (missing (NULL,0,1)) 2 Original single‑table 2 rows: (0,NULL,0), (0,1,0) → OK 2 CTE 1 row: (0,NULL,0) → FAIL (missing (0,1,0)) EXPLAIN Analysis Case 1 (derived table version) key output: text Sort with duplicate removal: a, b, c -> Stream results -> Group aggregate: sum(distinct exists(select #2)) -> Sort: a, ((l.c = l.id) and (0 <> ifnull(l.c,1767027735)) and ...) -> Inner hash join ... Top‑level Sort with duplicate removal: a, b, c shows that the aggregate column c is included in the duplicate‑removal key. Normally, DISTINCT should only deduplicate on non‑aggregated SELECT columns (here a and b). Including the aggregate column c is incorrect because aggregates do not determine row uniqueness at the group level. Case 2 (CTE version) similar output: text Sort with duplicate removal: a, b, c -> Group aggregate: sum(distinct exists(select #2)) -> Sort: a, (((NULL in (NULL,l.id)) is true) or (0 <> greatest(l.c,...))) Again, c is erroneously part of the duplicate‑removal key. Cause Analysis derived_merge mechanism: When the outer query references a derived table (or CTE), the optimizer attempts to merge it into the outer query to eliminate the temporary table. After merging, it must re‑handle DISTINCT, GROUP BY, and aggregate functions. Flawed rewrite logic: During merging, the optimizer incorrectly decides that duplicate removal should apply to all SELECT items, including the aggregate column c, thus adding Sort with duplicate removal: a, b, c at the top level. Semantic conflict: SUM(DISTINCT EXISTS(...)) is an aggregate that produces a single value per group (here constant 1 or 0). This value should not be part of the duplicate‑removal key, because deduplication is performed after grouping, where each group is already unique by the GROUP BY columns. However, the optimizer includes the aggregate column in the key, causing rows that differ only in b to be considered duplicates when a and c match. Why disabling derived_merge fixes it: When derived_merge=off, the derived table is materialized as a temporary table. The outer query treats it as a plain table, and the optimizer does not attempt to combine the outer DISTINCT with the inner grouping and aggregation. Thus, the erroneous duplicate‑removal key is not constructed; the standard semantics of GROUP BY and DISTINCT are preserved. How to repeat: Case 1 (Derived Table) DROP DATABASE IF EXISTS vp_min12; CREATE DATABASE vp_min12; USE vp_min12; CREATE TABLE s ( id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY, c FLOAT NULL ) ENGINE=InnoDB; INSERT INTO s(c) VALUES (NULL), (0.524686); CREATE TABLE l ( id BIGINT NOT NULL PRIMARY KEY, c FLOAT NULL ) ENGINE=InnoDB; CREATE TABLE r ( id BIGINT NOT NULL PRIMARY KEY ) ENGINE=InnoDB; INSERT INTO l SELECT id, c FROM s; INSERT INTO r SELECT id FROM s; -- Original single‑table query: returns 2 rows SELECT DISTINCT LEAST((! s.c), s.id NOT IN (NULL, NULL, s.c)) AS a, (((s.c) IN (s.id)) AND (IFNULL(s.c, 1767027735))) AND (0.07523026689631984) AS b, SUM(DISTINCT EXISTS (SELECT 1)) AS c FROM s GROUP BY LEAST((NOT s.c), s.id NOT IN (NULL, NULL, s.c)), (((s.c) IN (s.id)) AND (IFNULL(s.c, 1767027735))) AND (0.07523026689631984); -- Derived‑table rewrite (split): returns only 1 row (incorrect) SELECT DISTINCT LEAST((! x.c), x.id NOT IN (NULL, NULL, x.c)) AS a, (((x.c) IN (x.id)) AND (IFNULL(x.c, 1767027735))) AND (0.07523026689631984) AS b, SUM(DISTINCT EXISTS (SELECT 1)) AS c FROM ( SELECT l.id, l.c FROM (SELECT * FROM l) l JOIN (SELECT * FROM r) r ON l.id = r.id ) x GROUP BY LEAST((NOT x.c), x.id NOT IN (NULL, NULL, x.c)), (((x.c) IN (x.id)) AND (IFNULL(x.c, 1767027735))) AND (0.07523026689631984); Case 2 (CTE) DROP DATABASE IF EXISTS vp_min4; CREATE DATABASE vp_min4; USE vp_min4; CREATE TABLE s ( id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY, c DECIMAL(65,30) NULL ) ENGINE=InnoDB; INSERT INTO s(c) VALUES (NULL), (1); CREATE TABLE lt ( id BIGINT NOT NULL PRIMARY KEY, c DECIMAL(65,30) NULL ) ENGINE=InnoDB; CREATE TABLE rt ( id BIGINT NOT NULL PRIMARY KEY ) ENGINE=InnoDB; INSERT INTO lt SELECT id, c FROM s; INSERT INTO rt SELECT id FROM s; -- Original query: returns 2 rows SELECT DISTINCTROW (s.id) IS NOT TRUE AS a, (((NULL) IN (NULL, s.id)) IS TRUE) || ((+ (GREATEST(s.c, 0.4379614263980073)))) AS b, SUM(DISTINCT EXISTS (SELECT 1 WHERE FALSE)) AS c FROM s GROUP BY (s.id) IS NOT TRUE, (((NULL) IN (NULL, s.id)) IS TRUE) || ((+ (GREATEST(s.c, 0.4379614263980073)))); -- CTE rewrite: returns only 1 row (incorrect) WITH l AS (SELECT * FROM lt), r AS (SELECT * FROM rt) SELECT DISTINCTROW (l.id) IS NOT TRUE AS a, (((NULL) IN (NULL, l.id)) IS TRUE) || ((+ (GREATEST(l.c, 0.4379614263980073)))) AS b, SUM(DISTINCT EXISTS (SELECT 1 WHERE FALSE)) AS c FROM l JOIN r ON l.id = r.id GROUP BY (l.id) IS NOT TRUE, (((NULL) IN (NULL, l.id)) IS TRUE) || ((+ (GREATEST(l.c, 0.4379614263980073))));