Bug #112362 WHERE NOT IN with subquery is much slower on 8.1 than 5.7
Submitted: 15 Sep 2023 20:19 Modified: 19 Dec 2024 21:44
Reporter: Bradley Grainger (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S5 (Performance)
Version:8.1.0, 8.0.34 OS:Ubuntu (22.04)
Assigned to: CPU Architecture:x86
Tags: regression

[15 Sep 2023 20:19] Bradley Grainger
Description:
A query that was executing very quickly in MySQL 5.7.42 now executes much more slowly in MySQL 8.0.32 and 8.1.0. Instead of performing an index range scan, MySQL executes a full index scan; on a large table this may examine millions of rows, greatly impacting performance.

How to repeat:
Create the following two tables:

CREATE TABLE a
(
  a_id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT,
  value VARCHAR(100) NOT NULL COLLATE utf8mb4_bin,
  UNIQUE KEY ix_value_unique (value)
);

CREATE TABLE b
(
  b_id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT,
  value VARCHAR(100) NOT NULL COLLATE utf8mb4_bin,
  KEY ix_value (VALUE)
);

`a` will store a small number (thousands) of unique keys; `b` will store a large number (millions) of rows whose values may or may not exist in `a`.

A sample mysqldump file that creates 10,000 `a` values and 10,000,000 `b` values is available at https://drive.google.com/file/d/1KMRWIgMhejPzzBo9BQdJxTVt9tOXTQsh/view?usp=sharing

Run the following query to find all rows in `b` that don't have a matching value in `a`:

SELECT DISTINCT value FROM b
WHERE value NOT IN (SELECT value FROM a);

In MySQL 5.7.42 this returns "Empty set (0.07 sec)"
In MySQL 8.1.0 this returns "Empty set (2.31 sec)"

This is 33 times slower. In a production application that this was taken from, the performance was 95 times slower.

A workaround in MySQL 8.1.0 is to execute the following query instead:

WITH
  b2 AS (SELECT DISTINCT value FROM b),
  a2 AS (SELECT VALUE value FROM a)
SELECT value FROM b2
WHERE value NOT IN (SELECT value FROM a2);

It doesn't seem necessary that these CTEs should be necessary to force the query planner to pick a more optimal plan. (Additionally, this syntax is not valid on MySQL 5.7 so the application has to change its query based on which server version it's running against.)

The plan for 5.7 is:

+----+-------------+-------+------------+-------+-----------------+-----------------+---------+------+-------+----------+---------------------------------------+
| id | select_type | table | partitions | type  | possible_keys   | key             | key_len | ref  | rows  | filtered | Extra                                 |
+----+-------------+-------+------------+-------+-----------------+-----------------+---------+------+-------+----------+---------------------------------------+
|  1 | PRIMARY     | b     | NULL       | range | ix_value        | ix_value        | 402     | NULL | 10001 |   100.00 | Using where; Using index for group-by |
|  2 | SUBQUERY    | a     | NULL       | index | ix_value_unique | ix_value_unique | 402     | NULL | 10000 |   100.00 | Using index                           |
+----+-------------+-------+------------+-------+-----------------+-----------------+---------+------+-------+----------+---------------------------------------+

The plan for 8.1 is:

+----+--------------+-------------+------------+--------+---------------------+---------------------+---------+--------------+---------+----------+-----------------------------------+
| id | select_type  | table       | partitions | type   | possible_keys       | key                 | key_len | ref          | rows    | filtered | Extra                             |
+----+--------------+-------------+------------+--------+---------------------+---------------------+---------+--------------+---------+----------+-----------------------------------+
|  1 | SIMPLE       | b           | NULL       | index  | ix_value            | ix_value            | 402     | NULL         | 9111855 |   100.00 | Using index; Using temporary      |
|  1 | SIMPLE       | <subquery2> | NULL       | eq_ref | <auto_distinct_key> | <auto_distinct_key> | 403     | test.b.value |       1 |   100.00 | Using where; Not exists; Distinct |
|  2 | MATERIALIZED | a           | NULL       | index  | ix_value_unique     | ix_value_unique     | 402     | NULL         |   10000 |   100.00 | Using index                       |
+----+--------------+-------------+------------+--------+---------------------+---------------------+---------+--------------+---------+----------+-----------------------------------+

You can see that many more rows (almost 10 million) are examined.
[19 Sep 2023 11:27] MySQL Verification Team
Hi Mr. Grainger,

Thank you for your bug report.

However, there are literally hundreds of bug reports , which show significant slowdown for the read or write DMLs between 5.7 and 8.0/8.1.

This is a widely known fact.

We have found these bugs in this forum that show that behaviour:

 https://bugs.mysql.com/bug.php?id=94387, https://bugs.mysql.com/bug.php?id=93734,
https://bugs.mysql.com/bug.php?id=94283

However, we have found one bug, with exactly WHERE NOT IN, but this is an internally reported bug, which is not visible to you.

Hence, the performance degradation from 5.7 to 8.0 / 8.1 is very well known fact.

Duplicate.
[25 Sep 2023 14:12] MySQL Verification Team
Hello Bradley Grainger,

Thank you for the report and test case.
Verified as described.

regards,
Umesh
[25 Sep 2023 14:13] MySQL Verification Team
Test results - 8.0.11+,  5.7.43

Attachment: Bug112362_5.7.43_8.0.11_8.0.34_8.1.0.txt (text/plain), 18.09 KiB.

[26 Sep 2023 10:34] Roy Lyseng
Posted by developer:
 
Pretty strange nobody have noticed this regression before.
The "problem" with semi-join and anti-join is that it converts a single-table
query into a multi-table query, by dragging in the subquery tables, and
the index range scan algorithm is only implemented for single-table queries.

Until this has been fixed, there is a simple workaround that you can try:
disable semi-join (and anti-join) conversion by this command:

  set optimizer_switch='semijoin=off';

or use the equivalent query hint.
[19 Dec 2024 21:44] Jon Stephens
Documented fix as follows in the MySQL 8.0.41, 8.4.4, and 9.2.0 changelogs:

    In MySQL 8.0 and later, queries of the form SELECT DISTINCT ...
    FROM t1 WHERE NOT IN(SELECT ...) were transformed into an
    antijoin if possible, causing the optimizer not to choose a
    group skip scan for table t1 whereas it would have been chosen
    in MySQL 5.7. This resulted in a performance degradation for
    such queries. Group skip scan is not chosen, since the query is
    now no longer a single-table query following the antijoin
    transformation, and this access method is enabled only for
    single table queries. The same behaviour can be seen for queries
    which are transformed into semijoins as well. In such cases,
    group skip scan access method can still be used if the access
    method is used only for duplicate removal (that is, DISTINCT or
    GROUP BY without aggregate functions).

    To fix this, we enable group skip scan when there is only one
    table in the original query, irrespective of the number of
    semijoin tables present after internal transformations as long
    as the query contains no aggregate functions.

Closed.
[20 Dec 2024 10:45] MySQL Verification Team
Thank you, Jon.