Bug #120732 Derived Table Merging Causes Incorrect Duplicate Removal with DISTINCT + GROUP BY + SUM(DISTINCT EXISTS(subquery))
Submitted: 20 Jun 13:22
Reporter: Annie liu Email Updates:
Status: Open Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:MySQL 9.6 OS:Any (Storage Engine: InnoDB)
Assigned to: CPU Architecture:Any

[20 Jun 13:22] Annie liu
Description:
When a query containing DISTINCT (or DISTINCTROW), GROUP BY, and SUM(DISTINCT EXISTS(subquery)) is rewritten as a derived table (or CTE), the optimizer, with derived_merge=on (default), merges the derived table into the outer query and adds a Sort with duplicate removal step that includes the aggregate column c (the result of SUM(DISTINCT EXISTS(...)) in the duplicate‑removal key.

Because this aggregate is constant per group (e.g., 1), but the grouping columns (a and b) may differ (NULL vs 0), the duplicate removal incorrectly treats these rows as duplicates and keeps only one, leading to missing rows.

Disabling derived_merge restores the correct result, while disabling hash_join does not, confirming the issue lies in query rewriting rather than join algorithms.

Actual vs Expected Results

Case	Query Type	         Result Comparison (Actual → Expected)
1	Original single‑table	 2 rows: (NULL,NULL,1), (NULL,0,1) → OK
1	Derived table	         1 row: (NULL,NULL,1) → FAIL (missing (NULL,0,1))
2	Original single‑table	 2 rows: (0,NULL,0), (0,1,0) → OK
2	CTE	                 1 row: (0,NULL,0) → FAIL (missing (0,1,0))

EXPLAIN Analysis

Case 1 (derived table version) key output:
text
Sort with duplicate removal: a, b, c
  -> Stream results
      -> Group aggregate: sum(distinct exists(select #2))
          -> Sort: a, ((l.c = l.id) and (0 <> ifnull(l.c,1767027735)) and ...)
              -> Inner hash join ...
Top‑level Sort with duplicate removal: a, b, c shows that the aggregate column c is included in the duplicate‑removal key.

Normally, DISTINCT should only deduplicate on non‑aggregated SELECT columns (here a and b). Including the aggregate column c is incorrect because aggregates do not determine row uniqueness at the group level.

Case 2 (CTE version) similar output:
text
Sort with duplicate removal: a, b, c
  -> Group aggregate: sum(distinct exists(select #2))
      -> Sort: a, (((NULL in (NULL,l.id)) is true) or (0 <> greatest(l.c,...)))
Again, c is erroneously part of the duplicate‑removal key.

Cause Analysis

derived_merge mechanism: When the outer query references a derived table (or CTE), the optimizer attempts to merge it into the outer query to eliminate the temporary table. After merging, it must re‑handle DISTINCT, GROUP BY, and aggregate functions.

Flawed rewrite logic: During merging, the optimizer incorrectly decides that duplicate removal should apply to all SELECT items, including the aggregate column c, thus adding Sort with duplicate removal: a, b, c at the top level.

Semantic conflict: SUM(DISTINCT EXISTS(...)) is an aggregate that produces a single value per group (here constant 1 or 0). This value should not be part of the duplicate‑removal key, because deduplication is performed after grouping, where each group is already unique by the GROUP BY columns. However, the optimizer includes the aggregate column in the key, causing rows that differ only in b to be considered duplicates when a and c match.

Why disabling derived_merge fixes it: When derived_merge=off, the derived table is materialized as a temporary table. The outer query treats it as a plain table, and the optimizer does not attempt to combine the outer DISTINCT with the inner grouping and aggregation. Thus, the erroneous duplicate‑removal key is not constructed; the standard semantics of GROUP BY and DISTINCT are preserved.

How to repeat:
Case 1 (Derived Table)

DROP DATABASE IF EXISTS vp_min12;
CREATE DATABASE vp_min12;
USE vp_min12;

CREATE TABLE s (
  id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  c FLOAT NULL
) ENGINE=InnoDB;

INSERT INTO s(c) VALUES (NULL), (0.524686);

CREATE TABLE l (
  id BIGINT NOT NULL PRIMARY KEY,
  c FLOAT NULL
) ENGINE=InnoDB;

CREATE TABLE r (
  id BIGINT NOT NULL PRIMARY KEY
) ENGINE=InnoDB;

INSERT INTO l SELECT id, c FROM s;
INSERT INTO r SELECT id FROM s;

-- Original single‑table query: returns 2 rows
SELECT DISTINCT
  LEAST((! s.c), s.id NOT IN (NULL, NULL, s.c)) AS a,
  (((s.c) IN (s.id)) AND (IFNULL(s.c, 1767027735))) AND (0.07523026689631984) AS b,
  SUM(DISTINCT EXISTS (SELECT 1)) AS c
FROM s
GROUP BY
  LEAST((NOT s.c), s.id NOT IN (NULL, NULL, s.c)),
  (((s.c) IN (s.id)) AND (IFNULL(s.c, 1767027735))) AND (0.07523026689631984);

-- Derived‑table rewrite (split): returns only 1 row (incorrect)
SELECT DISTINCT
  LEAST((! x.c), x.id NOT IN (NULL, NULL, x.c)) AS a,
  (((x.c) IN (x.id)) AND (IFNULL(x.c, 1767027735))) AND (0.07523026689631984) AS b,
  SUM(DISTINCT EXISTS (SELECT 1)) AS c
FROM (
  SELECT l.id, l.c
  FROM (SELECT * FROM l) l
  JOIN (SELECT * FROM r) r ON l.id = r.id
) x
GROUP BY
  LEAST((NOT x.c), x.id NOT IN (NULL, NULL, x.c)),
  (((x.c) IN (x.id)) AND (IFNULL(x.c, 1767027735))) AND (0.07523026689631984);

Case 2 (CTE)

DROP DATABASE IF EXISTS vp_min4;
CREATE DATABASE vp_min4;
USE vp_min4;

CREATE TABLE s (
  id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  c DECIMAL(65,30) NULL
) ENGINE=InnoDB;

INSERT INTO s(c) VALUES (NULL), (1);

CREATE TABLE lt (
  id BIGINT NOT NULL PRIMARY KEY,
  c DECIMAL(65,30) NULL
) ENGINE=InnoDB;

CREATE TABLE rt (
  id BIGINT NOT NULL PRIMARY KEY
) ENGINE=InnoDB;

INSERT INTO lt SELECT id, c FROM s;
INSERT INTO rt SELECT id FROM s;

-- Original query: returns 2 rows
SELECT DISTINCTROW
  (s.id) IS NOT TRUE AS a,
  (((NULL) IN (NULL, s.id)) IS TRUE) || ((+ (GREATEST(s.c, 0.4379614263980073)))) AS b,
  SUM(DISTINCT EXISTS (SELECT 1 WHERE FALSE)) AS c
FROM s
GROUP BY
  (s.id) IS NOT TRUE,
  (((NULL) IN (NULL, s.id)) IS TRUE) || ((+ (GREATEST(s.c, 0.4379614263980073))));

-- CTE rewrite: returns only 1 row (incorrect)
WITH
  l AS (SELECT * FROM lt),
  r AS (SELECT * FROM rt)
SELECT DISTINCTROW
  (l.id) IS NOT TRUE AS a,
  (((NULL) IN (NULL, l.id)) IS TRUE) || ((+ (GREATEST(l.c, 0.4379614263980073)))) AS b,
  SUM(DISTINCT EXISTS (SELECT 1 WHERE FALSE)) AS c
FROM l JOIN r ON l.id = r.id
GROUP BY
  (l.id) IS NOT TRUE,
  (((NULL) IN (NULL, l.id)) IS TRUE) || ((+ (GREATEST(l.c, 0.4379614263980073))));