MySQL Bugs: #98249: Distinct in sub-query causes infinite run time with large table

Bug #98249	Distinct in sub-query causes infinite run time with large table
Submitted:	16 Jan 2020 10:31	Modified:	16 Jul 2020 14:11
Reporter:	Andras Janurik	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server	Severity:	S5 (Performance)
Version:	8.0.18	OS:	Windows (Server 2012)
Assigned to:		CPU Architecture:	x86

Description:
The following query does not return when executing on a large table (above a certain size):

SELECT count(*) FROM 
(SELECT distinct Key3 FROM test) t;

Without the distinct it runs properly.

IMPORTANT: this was not a problem with MySQL 8.0.16.
It was not tested with 8.0.17.

How to repeat:
Test table:
CREATE TABLE `test` (
  `Key1` smallint(6) NOT NULL,
  `Key2` date NOT NULL,
  `Key3` varchar(20) NOT NULL,
  `Value` float DEFAULT NULL,
  PRIMARY KEY (`Key1`,`Key2`,`Key3`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Query:
SELECT count(*) FROM 
(SELECT distinct Key3 FROM test) t;

The query returns properly within 12 seconds when the table contains 10 million rows.
The query runs for hours and does not return when the table contains 15 million rows.

The problem: the distinct in the sub-query.

Workaround: if the distinct is moved from the sub-query to the main query it returns properly:

SELECT count(distinct Key3) FROM 
(SELECT Key3 FROM test) t;

NOTE: it is not always reproducible.
The original table was 30M rows. 

Test #1
I created a test table and copied over 10M rows. The query worked.
(insert into test select * from original limit 10000000)
I copied over another 5M rows. The query did not work.
(insert into test select * from original limit 10000000, 5000000)

Test #2
I started over, truncated the test table and copied over 10M rows. The query worked.
I copied over another 1M rows. The query worked.
I copied over another 1M rows. The query worked.
I copied over another 1M rows. The query worked.
I copied over another 1M rows. The query worked.
I copied over another 1M rows. The query worked.
I copied over another 1M rows. The query worked. (16M rows)

So at first the query did not work with 15M rows but the second time it worked with even 16M rows.

Somehow truncating the table made a difference.
Maybe it is a key/index generation problem?

Hi Mr. Janurik,

Thank you for your bug report.

If I understand you correctly, performance on a table with 10 M rows is fast, but with 15 M rows is slow.

I do hope that you understand that each of your rows is approximately 100 bytes wide. I also suppose that you know that DiSTINCT is a VERY slow operation.

However, if you would send us a script that would generate 10 M and 15 M rows for that table, we would test and see whether you are correct.

RE: Sinisa Milivojevic

Hi, thank you for your message.

> If I understand you correctly, performance on a table with 10 M rows is fast, but with 15 M rows is slow.

Well, performance with 10M rows is fine, with 15M rows it is not just slow but the query never really returns.

> I do hope that you understand that each of your rows is approximately 100 bytes wide. I also suppose that you know that DiSTINCT is a VERY slow operation.

Yes, I understand that. The same query works fine with 30M rows on MySQL 8.0.16

> However, if you would send us a script that would generate 10 M and 15 M rows for that table, we would test and see whether you are correct.

Ok, here is a script that proves the point on my machine:

===== script start =====
CREATE TABLE `test` (
  `Key1` int NOT NULL AUTO_INCREMENT,
  `Key2` date NOT NULL,
  `Key3` varchar(20) NOT NULL,
  `Value` float DEFAULT NULL,
  PRIMARY KEY (`Key1`,`Key2`,`Key3`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

INSERT INTO `test` (`Key2`, `Key3`, `Value`) VALUES ('2000-01-01', 'key3', '1.1');

-- Generate 16777216 rows
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;

-- Test
-- The following two queries are identical in result but different in method

-- This query completes in 9 seconds on a test machine
SELECT count(*) FROM (SELECT distinct Key3 FROM test) t;

-- This query completes in 110 seconds on the same test machine
-- With larger data set it never completes (or more than several hours)
SELECT count(distinct Key3) FROM (SELECT Key3 FROM test) t;

-- The two queries complete approx. the same time on MySQL 8.0.16
-- The problem still exists with MySQL 8.0.19
===== script end =====

I am sorry, I mixed up the last two test queries in my previous comment, here is the test section properly:

-- Test
-- The following two queries are identical in result but different in method

-- This query completes in 9 seconds on a test machine
SELECT count(distinct Key3) FROM (SELECT Key3 FROM test) t;

-- This query completes in 110 seconds on the same test machine
-- With larger data set it never completes (or more than several hours)
SELECT count(*) FROM (SELECT distinct Key3 FROM test) t;

-- The two queries complete approx. the same time on MySQL 8.0.16
-- The problem still exists with MySQL 8.0.19
===== script end =====

Hello Mr. Janurik,

Since you are using InnoDB SE, I am afraid that this is a very well known problem and not a bug.

Simply, InnoDB is a MVCC SE, which means that it has to scan the entire table for each SELECT count(*) .... command. That is why the performance is slow. Scanning a huge table (temporary or not), takes lot's of time, sometimes hours.

That is a reason why most programmers query the I_S table for the approximate number of rows. In your case, it would be a correct value, since two connections can not use a single temporary table.

Difference between older and newer 8.0 release is due to the changes in the temporary tables engine.

Not a bug.

Mr. Milivojevic,

I understand what you say but I don't think count(*) is the real problem here. Actually in our original query we did not use count(*), I just made it up for the test.

Let me show you the original query.
Another table is needed for the test:

CREATE TABLE `key3_list` (
  `Key3` VARCHAR(20) NOT NULL,
  PRIMARY KEY (`Key3`)
) ENGINE=INNODB DEFAULT CHARSET=UTF8MB4;

INSERT INTO `key3_list` (`Key3`) VALUES ('key3');

Test query #1:

SELECT t.Key3 FROM 
(SELECT distinct Key3 FROM test) t
LEFT JOIN key3_list k USING (Key3) WHERE k.Key3 IS NULL;

Test query #2:

SELECT distinct t.Key3 FROM 
(SELECT Key3 FROM test) t
LEFT JOIN key3_list k USING (Key3) WHERE k.Key3 IS NULL;

The only difference between the two queries is that the distinct is in the main query or the sub-query. (This is the main point of my bug report.)

Run times:

8.0.16
Query #1: 9 sec
Query #2: 21 sec

This is logical because the join is made on a shorter list in #1.

8.0.19
Query #1: 107 sec
Query #2: 21 sec

This is not logical because the join is made on a shorter list in #1.
Why is #1 so much slower on 8.0.19?

IMPORTANT
Run time escalates exponentially with larger data set.

I added another 4M rows to the table:
INSERT INTO `temp`.`test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test limit 4000000;

8.0.19
Query #1: 225 sec
Query #2: 29 sec

So data size is 25% larger but #1 run time is more than double.

Can you please help me understand this?

Hi Mr. Janurik,

First of all, you are coming now with a completely different proposition. Your queries are changed so much that the causes of the execution speed differences are totally different. In short, this is completely different bug report.

There are two different possibilities for this regression. First one is that there are three internal bugs fixed between 8.0.16 and 8.0.18 that might have influenced the performance regression. Second, 8.0.18 has seen the introduction of the hash joins.

Hence, we need the explanation for the execution plan for the first query in both 8.0.16 and 8.0.18.

In order to get that info, we need extended EXPLAIN outputs for that query with both releases. Please, use the default TABLE format output, since there is a bug in other outputs, when dealing with hash joins.

We shall wait on your feedback.

Mr. Milivojevic,

> First of all, you are coming now with a completely different proposition. Your queries are changed so much that the causes of the execution speed differences are totally different. In short, this is completely different bug report.

I do not agree with this. My point was from the very beginning that distinct breaks the query when in the sub-query. It is true regardless of what I use in the main query, whether it is a "select *" or a "select count(*)". In my opinion it is the same bug report.

> In order to get that info, we need extended EXPLAIN outputs for that query with both releases. Please, use the default TABLE format output, since there is a bug in other outputs, when dealing with hash joins.

Can you please show me how to produce this output?

There is a chapter in our 8.0 Reference Manual only on EXPLAIN. 

Entire syntax is explained there.

Extended explain output MySQL Workbench csv export

EXPLAIN
SELECT t.Key3 FROM 
(SELECT DISTINCT Key3 FROM test) t
LEFT JOIN key3_list k USING (Key3) WHERE k.Key3 IS NULL;

SHOW WARNINGS;

=== MySQL 8.0.16 ===

EXPLAIN
id,select_type,table,partitions,type,possible_keys,key,key_len,ref,rows,filtered,Extra
1,PRIMARY,<derived2>,NULL,ALL,NULL,NULL,NULL,NULL,15636644,100.00,NULL
1,PRIMARY,k,NULL,eq_ref,PRIMARY,PRIMARY,82,t.Key3,1,100.00,"Using where; Not exists; Using index"
2,DERIVED,test,NULL,index,PRIMARY,PRIMARY,89,NULL,15636644,100.00,"Using index; Using temporary" 

WARNINGS
Level,Code,Message
Note,1003,"/* select#1 */ select `t`.`Key3` AS `Key3` from (/* select#2 */ select distinct `temp`.`test`.`Key3` AS `Key3` from `temp`.`test`) `t` left join `temp`.`key3_list` `k` on((`temp`.`k`.`Key3` = `t`.`Key3`)) where isnull(`temp`.`k`.`Key3`)" 

=== MySQL 8.0.19 ===

EXPLAIN
id,select_type,table,partitions,type,possible_keys,key,key_len,ref,rows,filtered,Extra
1,PRIMARY,<derived2>,NULL,ALL,NULL,NULL,NULL,NULL,19980734,100.00,NULL
1,PRIMARY,k,NULL,eq_ref,PRIMARY,PRIMARY,82,t.Key3,1,100.00,"Using where; Not exists; Using index"
2,DERIVED,test,NULL,index,PRIMARY,PRIMARY,89,NULL,19980734,100.00,"Using index; Using temporary" 

WARNINGS
Level,Code,Message
Note,1003,"/* select#1 */ select `t`.`Key3` AS `Key3` from (/* select#2 */ select distinct `temp`.`test`.`Key3` AS `Key3` from `temp`.`test`) `t` left join `temp`.`key3_list` `k` on((`temp`.`k`.`Key3` = `t`.`Key3`)) where (`temp`.`k`.`Key3` is null)"

Hi!

Thank you for the output.

Do you have any idea of why 8.0.16 sees 16 million rows, while 8.0.19 sees almost 20 million rows ????

Hi,

You are welcome for the output.

Regarding the 20 million rows: in my previous comment [5 Feb 9:31] I wrote that I added another 4M rows to the 16M row table to show that run time escalates exponentially with larger data set.

I can remove it if it is a problem.

No, it is not a problem. Don't remove ......

Hi,

There is a possibility that this report is a duplicate of an internally reported bug, that is not visible to you.

If there are less then 50.000 rows in the output of the query, can you run this command and then run your first query again and measure the time. 

First use this command on 8.0.19 server, all from the same connection:

SET SESSION internal_tmp_mem_storage_engine=MEMORY;

Then run your first query and measure time. After this, you can close that session.

Just report the time back to us, please ....

Hi,

This fixed it.
The query run time changed from 135 sec to 18 sec.

How can I know when this fix is released?

In the mean time shall I use this workaround? Is it safe to use (under 50k)?

Thanks

So, based on all test cases, this bug is a duplicate of an internal bug.

You can use this workaround for small result sets. That depends on the RAM that you have.

When the bug is fixed, this page will be updated.

Hi, 

Can you please let me know if this bug was fixed?
8.0.20 and 8.0.21 was released in the mean time but I cannot tell based on the change logs whether it was fixed.

Thanks

I haven't tested this testcase myself, but this seems relevant:

https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-21.html

"InnoDB: Chunk size was not calculated correctly when deallocating memory from the TempTable storage engine, causing a regression in SELECT DISTINCT query performance. (Bug #30562964)"

Hi,

Yes, it seems like the problem is gone, query execution times look good.

Interestingly, insert execution times are much higher. The test script that inserts the 16 million row test data now takes twice as long. But that is a different issue I guess.

Thank you, Mr. Janurik.