MySQL Bugs: #116612: Performance degradation of distinct operator from 5.7 to 8.0

Bug #116612	Performance degradation of distinct operator from 5.7 to 8.0
Submitted:	11 Nov 2024 11:46	Modified:	11 Nov 2024 14:07
Reporter:	Jingqi Tian (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S5 (Performance)
Version:	8.0.40, 8.0.41, 8.4.4	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	regression

Description:
Performance degradation of distinct operator from 5.7 to 8.0.

How to repeat:
1. Create table

CREATE TABLE t1 (
	id int PRIMARY KEY,
	col1 int
);

CREATE TABLE t2 (
	id int PRIMARY KEY,
	col1 int,
	col2 varchar(1000),
	col3 varchar(1000),
	col4 varchar(1000),
	KEY index_col1 (col1)
);

CREATE TABLE t3 (
	id int PRIMARY KEY,
	col1 int,
	col2 varchar(1000),
	col3 varchar(1000),
	col4 varchar(1000),
	KEY index_col1 (col1)
);

CREATE TABLE t4 (
	id int PRIMARY KEY,
	col1 int,
	col2 varchar(1000),
	col3 varchar(1000),
	col4 varchar(1000),
	KEY index_col1 (col1)
);

2. Create procedure

CREATE PROCEDURE insert_data ()
BEGIN
	DECLARE num int;
	SET num = 1;
	WHILE num < 10000 DO
	INSERT INTO t1
	VALUES (num, num / 100);
	IF num < 1000 THEN
		INSERT INTO t2
		VALUES (num, num / 10, repeat('a', 1000), repeat('a', 1000), repeat('a', 1000));
		INSERT INTO t3
		VALUES (num, num / 10, repeat('a', 1000), repeat('a', 1000), repeat('a', 1000));
        INSERT INTO t4
		VALUES (num, num / 10, repeat('a', 1000), repeat('a', 1000), repeat('a', 1000));
	END IF;
	SET num = num + 1;
	END WHILE;
END;

3. Call procedure

call insert_data();

4. Execute the same sql on MySQL 5.7 and MySQL 8.0

select distinct t1.* from t1 left join t2 on t1.col1 = t2.col1 left join t3 on t1.col1 = t3.col1 left join t4 on t1.col1 = t4.col1;

On MySQL 5.7, it takes 0.04s, while on MySQL 8.0, it takes 4.36s.

Hello Jingqi Tian,

Thank you for the report and test case.

regards,
Umesh

Hello Jingqi Tian,

Internally development proposed a workaround, I'm quoting here for your reference.

Posted By Developer
==================

Given the query:

select distinct t1.* from t1 left join t2 on t1.col1 = t2.col1 left join t3 on t1.col1 = t3.col1 left join t4 on t1.col1 = t4.col1;

The only table referred to in the SELECT list and WHERE clause is t1. Since t1 is also the only table on the left side of a LEFT JOIN, we know the join will return all rows in t1, regardless of whether there are any matches in the other tables.  Since DISTINCT is specified, we don't need to know how many rows match. So we can remove all the different tables and reduce the query to this:

For example:

select distinct t1.* from t1 left join t2 on t1.col1 = t2.col1 left join t3 on t1.col1 = t3.col1 left join t4 on t1.col1 = t4.col1;

The result is the same as the result of

select distinct t1.* from t1;

Hence rest of the joins and table access are not needed in the given test case.

regards,
Umesh