Bug #120861 SELECT DISTINCT inner join expands duplicate-insensitive fanout instead of using semijoin/existence semantics
Submitted: 3 Jul 12:43 Modified: 4 Jul 13:48
Reporter: Zack Morgan Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S5 (Performance)
Version:9.7.1 OS:Ubuntu (22.04.4 LTS)
Assigned to: CPU Architecture:x86 (x86_64, Intel(R) Xeon(R) Gold 5220)
Tags: join-order, Optimizer, performance, semijoin

[3 Jul 12:43] Zack Morgan
Description:
I found a SELECT DISTINCT query where one inner-joined table contributes no columns to the final DISTINCT projection and is only used to test existence of matching rows.

The query shape is:

  SELECT DISTINCT t0.c1 AS ref0, t3.c1, t0.c0
  FROM t3
  JOIN t0 ON t3.c0 = t0.c0
  JOIN t1 ON t1.c0 = t0.c0
  ORDER BY t0.c0
  LIMIT 8;

In this query, t1 contributes no projected columns. It is only used to check whether a matching t1.c0 exists. Since the result is SELECT DISTINCT over columns from t0 and t3, duplicate matches from t1 do not change the final result. Therefore, the t1 side is duplicate-insensitive and could be handled like an existence / semijoin check.

However, MySQL expands the t1 fanout in the inner-join form, produces many duplicate intermediate rows, and only removes them later with DISTINCT.

Data shape from the reproducer:

  t0: 90 rows, 8 distinct c0 values
  t1: 203 rows, 32 distinct c0 values
  t3: 117 rows, 8 distinct c0 values

There is significant fanout on t1. For example:

  c0 = 1: t0_rows = 12, t1_rows = 48, t3_rows = 15, raw_inner_join_rows = 8640
  c0 = 2: t0_rows = 11, t1_rows = 19, t3_rows = 15, raw_inner_join_rows = 3135
  c0 = 3: t0_rows = 11, t1_rows = 15, t3_rows = 15, raw_inner_join_rows = 2475

The default DISTINCT inner join, the JOIN_ORDER hinted query, and the EXISTS rewrite all return the same result:

  default_rows = 7
  hinted_rows  = 7
  exists_rows  = 7

The set differences are all zero:

  default_minus_hinted = 0
  hinted_minus_default = 0
  default_minus_exists = 0
  exists_minus_default = 0

Observed behavior for the default DISTINCT inner join:

  -> Limit: 8 row(s)  (cost=422..422 rows=8) (actual time=18.7..18.7 rows=7 loops=1)
      -> Table scan on <temporary>  (cost=422..433 rows=742) (actual time=18.7..18.7 rows=7 loops=1)
          -> Temporary table with deduplication  (cost=422..422 rows=742) (actual time=18.7..18.7 rows=7 loops=1)
              -> Limit table size: 8 unique row(s)
                  -> Nested loop inner join  (cost=251 rows=742) (actual time=0.043..13 rows=20223 loops=1)
                      -> Nested loop inner join  (cost=27.9 rows=50.8) (actual time=0.0324..0.87 rows=1368 loops=1)
                          -> Filter: (t0.c0 is not null)  (cost=0.0933 rows=8) (actual time=0.00834..0.0603 rows=90 loops=1)
                              -> Covering index scan on t0 using i1  (cost=0.0933 rows=8) (actual time=0.00762..0.0528 rows=90 loops=1)
                          -> Covering index lookup on t1 using i2 (c0 = t0.c0)  (cost=0.259 rows=6.34) (actual time=973e-6..0.00801 rows=15.2 loops=90)
                      -> Covering index lookup on t3 using i7 (c0 = t0.c0)  (cost=0.263 rows=14.6) (actual time=920e-6..0.00791 rows=14.8 loops=1368)

This expands t1 first, creates 1368 intermediate rows before joining t3, then produces 20223 rows before DISTINCT reduces the final result to only 7 rows.

A JOIN_ORDER(t3,t0,t1) hinted plan is much faster:

  -> Limit: 8 row(s)  (actual time=2.92..2.92 rows=7 loops=1)
      -> Sort: t0.c0, limit input to 8 row(s) per chunk  (actual time=2.92..2.92 rows=7 loops=1)
          -> Table scan on <temporary>  (cost=941..960 rows=1316) (actual time=2.91..2.91 rows=7 loops=1)
              -> Temporary table with deduplication  (cost=941..941 rows=1316) (actual time=2.91..2.91 rows=7 loops=1)
                  -> Nested loop inner join  (cost=638 rows=1316) (actual time=0.323..2.53 rows=1137 loops=1)
                      -> Nested loop inner join  (cost=174 rows=1316) (actual time=0.0132..0.948 rows=1317 loops=1)
                          -> Filter: (t3.c0 is not null)  (cost=12 rows=117) (actual time=0.0083..0.0744 rows=117 loops=1)
                              -> Covering index scan on t3 using i7  (cost=12 rows=117) (actual time=0.00765..0.0649 rows=117 loops=1)
                          -> Covering index lookup on t0 using i1 (c0 = t3.c0)  (cost=0.267 rows=11.2) (actual time=962e-6..0.00672 rows=11.3 loops=117)
                      -> Limit: 1 row(s)  (cost=0.253 rows=1) (actual time=992e-6..0.00102 rows=0.863 loops=1317)
                          -> Covering index lookup on t1 using i2 (c0 = t3.c0)  (cost=0.253 rows=6.34) (actual time=860e-6..860e-6 rows=0.863 loops=1317)

This plan effectively treats the t1 side as an existence check by using Limit: 1 row(s), avoiding expansion of all duplicate t1 rows.

An equivalent EXISTS rewrite is even faster:

  -> Limit: 8 row(s)  (cost=8043..8043 rows=8) (actual time=1.31..1.31 rows=7 loops=1)
      -> Table scan on <temporary>  (cost=8043..8342 rows=23751) (actual time=1.3..1.31 rows=7 loops=1)
          -> Temporary table with deduplication  (cost=8043..8043 rows=23751) (actual time=1.3..1.3 rows=7 loops=1)
              -> Limit table size: 8 unique row(s)
                  -> Nested loop inner join  (cost=2570 rows=23751) (actual time=0.156..0.961 rows=1137 loops=1)
                      -> Nested loop inner join  (cost=171 rows=1624) (actual time=0.152..0.24 rows=78 loops=1)
                          -> Filter: (t0.c0 is not null)  (cost=0.0933 rows=8) (actual time=0.00619..0.058 rows=90 loops=1)
                              -> Covering index scan on t0 using i1  (cost=0.0933 rows=8) (actual time=0.00573..0.0506 rows=90 loops=1)
                          -> Single-row index lookup on <subquery2> using <auto_distinct_key> (c0 = t0.c0)  (cost=67.8..67.8 rows=1) (actual time=0.00178..0.00185 rows=0.867 loops=90)
                              -> Materialize with deduplication  (cost=67.8..67.8 rows=203) (actual time=0.131..0.131 rows=32 loops=1)
                                  -> Filter: (t1.c0 is not null)  (cost=21.1 rows=203) (actual time=0.00183..0.0963 rows=203 loops=1)
                                      -> Covering index scan on t1 using i2  (cost=21.1 rows=203) (actual time=0.00167..0.0842 rows=203 loops=1)
                      -> Covering index lookup on t3 using i7 (c0 = t0.c0)  (cost=0.276 rows=14.6) (actual time=0.001..0.00821 rows=14.6 loops=78)

The EXISTS rewrite materializes/deduplicates t1.c0 and uses a single-row lookup, avoiding duplicate fanout.

The same issue also appears without ORDER BY/LIMIT. The DISTINCT inner join still expands 20223 rows and takes about 20 ms, while the hinted form takes about 3.5 ms and the EXISTS rewrite takes about 1.3 ms.

So this does not appear to be only an ORDER BY/LIMIT early-out issue. The more fundamental issue is that a duplicate-insensitive inner join under SELECT DISTINCT is not optimized/costed like an existence or semijoin check.

How to repeat:
Run the attached SQL file:

  mysql_distinct_semijoin_missed_repro.sql

For example:

  mysql -h127.0.0.1 -P34971 -uroot -pxxx --comments --table \
    < mysql_distinct_semijoin_missed_repro.sql \
    > mysql_distinct_semijoin_missed_repro_result.txt 2>&1

The test case creates four InnoDB tables, inserts deterministic data, creates indexes, and runs ANALYZE TABLE.

The important query is:

  SELECT DISTINCT t0.c1 AS ref0, t3.c1, t0.c0
  FROM t3
  JOIN t0 ON t3.c0 = t0.c0
  JOIN t1 ON t1.c0 = t0.c0
  ORDER BY t0.c0
  LIMIT 8;

The query is compared with:

1. A JOIN_ORDER(t3,t0,t1) hinted variant.
2. An equivalent EXISTS rewrite:

     SELECT DISTINCT t0.c1 AS ref0, t3.c1, t0.c0
     FROM t3
     JOIN t0 ON t3.c0 = t0.c0
     WHERE EXISTS (
       SELECT 1
       FROM t1
       WHERE t1.c0 = t0.c0
     )
     ORDER BY t0.c0
     LIMIT 8;

The script also checks result equivalence:

  default_rows = 7
  hinted_rows  = 7
  exists_rows  = 7

And all set differences are zero:

  default_minus_hinted = 0
  hinted_minus_default = 0
  default_minus_exists = 0
  exists_minus_default = 0

Expected observation:

The default SELECT DISTINCT inner join expands duplicate t1 fanout and produces many duplicate intermediate rows before deduplication.

Observed on MySQL 9.7.1:

  default DISTINCT inner join: around 18.7-21.3 ms
  JOIN_ORDER(t3,t0,t1):       around 2.9-3.6 ms
  EXISTS rewrite:             around 1.3 ms

The attached output file contains full EXPLAIN FORMAT=JSON and EXPLAIN ANALYZE output.

Suggested fix:
Consider detecting cases where an inner-joined table contributes no columns to the SELECT DISTINCT projection and is only used to test existence of matching rows.

In this test case, t1 is only used through:

  t1.c0 = t0.c0

and no t1 column is projected. Because the query is SELECT DISTINCT over columns from t0 and t3, duplicate matches from t1 do not affect the final result.

The optimizer could treat this as a duplicate-insensitive semijoin/existence check, or at least cost join orders that place such a table last with early-out / first-match behavior.

An equivalent EXISTS rewrite already allows MySQL to use materialization with deduplication and single-row lookup on the deduplicated t1.c0 values. A similar transformation or costing rule for the SELECT DISTINCT inner-join form could avoid expanding duplicate fanout before DISTINCT.
[4 Jul 13:48] Chaithra Marsur Gopala Reddy
Hi Zack Morgan,

Thank you for the test case. Verified as described.