Bug #100536 Regression on 8.0 when using DISTINCT and LEFT JOIN
Submitted: 15 Aug 2020 0:48 Modified: 17 Feb 13:47
Reporter: Juan Arruti Email Updates:
Status: Won't fix Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S5 (Performance)
Version:8.0.21, 8.0.23 OS:Any
Assigned to: CPU Architecture:Any
Tags: regression

[15 Aug 2020 0:48] Juan Arruti
Description:
Query performance when using DISTINCT and LEFT JOIN on large tables has changed from 5.6 and 5.7 to 8.0. 

It does not seem to be related with TempTable storage engine since when changing to MEMORY I've got same performance.

I'll upload the test case to reproduce this issue.

How to repeat:
Below you can see an example where the query drastically increases elapsed time on 8.0:

# 5.7
mysql [localhost:5727] {msandbox} (test) > source ./test_case.sql
...

mysql [localhost:5727] {msandbox} (test) > explain SELECT    distinct t1.c1
    -> FROM      t1
    -> LEFT JOIN t2
    -> ON        t1.c1 = t2.c1
    -> LEFT JOIN t3
    -> ON        t1.c1 = t3.c1
    -> WHERE     t1.c2 = 15308 \g
+----+-------------+-------+------------+------+---------------+---------+---------+------------+--------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key     | key_len | ref        | rows   | filtered | Extra                        |
+----+-------------+-------+------------+------+---------------+---------+---------+------------+--------+----------+------------------------------+
|  1 | SIMPLE      | t1    | NULL       | ALL  | PRIMARY       | NULL    | NULL    | NULL       | 194304 |    10.00 | Using where; Using temporary |
|  1 | SIMPLE      | t2    | NULL       | ref  | PRIMARY       | PRIMARY | 4       | test.t1.c1 |     14 |   100.00 | Using index; Distinct        |
|  1 | SIMPLE      | t3    | NULL       | ref  | PRIMARY       | PRIMARY | 4       | test.t1.c1 |     65 |   100.00 | Using index; Distinct        |
+----+-------------+-------+------------+------+---------------+---------+---------+------------+--------+----------+------------------------------+
3 rows in set, 1 warning (0.00 sec)

mysql [localhost:5727] {msandbox} (test) > pager md5sum ;
PAGER set to 'md5sum'

mysql [localhost:5727] {msandbox} (test) > SELECT    distinct t1.c1
    -> FROM      t1
    -> LEFT JOIN t2
    -> ON        t1.c1 = t2.c1
    -> LEFT JOIN t3
    -> ON        t1.c1 = t3.c1
    -> WHERE     t1.c2 = 15308 \g
ee30c16e7b8af0c7e0403c19d86e477f  -
1605 rows in set (0.05 sec)

# 8.0
mysql [localhost:4581] {msandbox} (test) > select @@version ;
+-----------+
| @@version |
+-----------+
| 8.0.21    |
+-----------+
1 row in set (0.00 sec)

mysql [localhost:4581] {msandbox} (test) > source ./test_case.sql
...

mysql [localhost:4581] {msandbox} (test) > explain SELECT    distinct t1.c1
    -> FROM      t1
    -> LEFT JOIN t2
    -> ON        t1.c1 = t2.c1
    -> LEFT JOIN t3
    -> ON        t1.c1 = t3.c1
    -> WHERE     t1.c2 = 15308 \g 
+----+-------------+-------+------------+------+---------------+---------+---------+------------+--------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key     | key_len | ref        | rows   | filtered | Extra                        |
+----+-------------+-------+------------+------+---------------+---------+---------+------------+--------+----------+------------------------------+
|  1 | SIMPLE      | t1    | NULL       | ALL  | PRIMARY       | NULL    | NULL    | NULL       | 194304 |    10.00 | Using where; Using temporary |
|  1 | SIMPLE      | t2    | NULL       | ref  | PRIMARY       | PRIMARY | 4       | test.t1.c1 |     10 |   100.00 | Using index; Distinct        |
|  1 | SIMPLE      | t3    | NULL       | ref  | PRIMARY       | PRIMARY | 4       | test.t1.c1 |     72 |   100.00 | Using index; Distinct        |
+----+-------------+-------+------------+------+---------------+---------+---------+------------+--------+----------+------------------------------+
3 rows in set, 1 warning (0.00 sec)

mysql [localhost:4581] {msandbox} (test) > pager md5sum ;
PAGER set to 'md5sum'

mysql [localhost:4581] {msandbox} (test) > SELECT    distinct t1.c1
    -> FROM      t1
    -> LEFT JOIN t2
    -> ON        t1.c1 = t2.c1
    -> LEFT JOIN t3
    -> ON        t1.c1 = t3.c1
    -> WHERE     t1.c2 = 15308 \g
ee30c16e7b8af0c7e0403c19d86e477f  -
1605 rows in set (1 min 37.81 sec)
[17 Aug 2020 12:55] MySQL Verification Team
Hello Juan Arruti,

Thank you for the report and test case.
Verified as described.

regards,
Umesh
[17 Feb 13:47] Erlend Dahl
[3 Feb 2021 2:15] Steinar Gunderson

5.7 and older have a special optimization for DISTINCT plus unused tables; it
was traditionally a strange way of writing semijoins before we had semijoins.
For 8.0, we only carried this optimization through for the specific case of
inner joins against a single unused table; all others should use semijoin
instead.

In fact, this query uses an outer join instead of an inner join, which makes
it even stranger (t2 and t3 do nothing at all). Just remove the two joins,
and will execute quickly:

mysql> SELECT DISTINCT t1.c1 FROM t1;

Closing as won't fix.
[...]
194100 rows in set (0,04 sec)