Bug #98249 Distinct in sub-query causes infinite run time with large table
Submitted: 16 Jan 2020 10:31 Modified: 16 Jul 2020 14:11
Reporter: Andras Janurik Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server Severity:S5 (Performance)
Version:8.0.18 OS:Windows (Server 2012)
Assigned to: CPU Architecture:x86

[16 Jan 2020 10:31] Andras Janurik
Description:
The following query does not return when executing on a large table (above a certain size):

SELECT count(*) FROM 
(SELECT distinct Key3 FROM test) t;

Without the distinct it runs properly.

IMPORTANT: this was not a problem with MySQL 8.0.16.
It was not tested with 8.0.17.

How to repeat:
Test table:
CREATE TABLE `test` (
  `Key1` smallint(6) NOT NULL,
  `Key2` date NOT NULL,
  `Key3` varchar(20) NOT NULL,
  `Value` float DEFAULT NULL,
  PRIMARY KEY (`Key1`,`Key2`,`Key3`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Query:
SELECT count(*) FROM 
(SELECT distinct Key3 FROM test) t;

The query returns properly within 12 seconds when the table contains 10 million rows.
The query runs for hours and does not return when the table contains 15 million rows.

The problem: the distinct in the sub-query.

Workaround: if the distinct is moved from the sub-query to the main query it returns properly:

SELECT count(distinct Key3) FROM 
(SELECT Key3 FROM test) t;

NOTE: it is not always reproducible.
The original table was 30M rows. 

Test #1
I created a test table and copied over 10M rows. The query worked.
(insert into test select * from original limit 10000000)
I copied over another 5M rows. The query did not work.
(insert into test select * from original limit 10000000, 5000000)

Test #2
I started over, truncated the test table and copied over 10M rows. The query worked.
I copied over another 1M rows. The query worked.
I copied over another 1M rows. The query worked.
I copied over another 1M rows. The query worked.
I copied over another 1M rows. The query worked.
I copied over another 1M rows. The query worked.
I copied over another 1M rows. The query worked. (16M rows)

So at first the query did not work with 15M rows but the second time it worked with even 16M rows.

Somehow truncating the table made a difference.
Maybe it is a key/index generation problem?
[16 Jan 2020 14:59] MySQL Verification Team
Hi Mr. Janurik,

Thank you for your bug report.

If I understand you correctly, performance on a table with 10 M rows is fast, but with 15 M rows is slow.

I do hope that you understand that each of your rows is approximately 100 bytes wide. I also suppose that you know that DiSTINCT is a VERY slow operation.

However, if you would send us a script that would generate 10 M and 15 M rows for that table, we would test and see whether you are correct.
[4 Feb 2020 16:24] Andras Janurik
RE: Sinisa Milivojevic

Hi, thank you for your message.

> If I understand you correctly, performance on a table with 10 M rows is fast, but with 15 M rows is slow.

Well, performance with 10M rows is fine, with 15M rows it is not just slow but the query never really returns.

> I do hope that you understand that each of your rows is approximately 100 bytes wide. I also suppose that you know that DiSTINCT is a VERY slow operation.

Yes, I understand that. The same query works fine with 30M rows on MySQL 8.0.16

> However, if you would send us a script that would generate 10 M and 15 M rows for that table, we would test and see whether you are correct.

Ok, here is a script that proves the point on my machine:

===== script start =====
CREATE TABLE `test` (
  `Key1` int NOT NULL AUTO_INCREMENT,
  `Key2` date NOT NULL,
  `Key3` varchar(20) NOT NULL,
  `Value` float DEFAULT NULL,
  PRIMARY KEY (`Key1`,`Key2`,`Key3`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

INSERT INTO `test` (`Key2`, `Key3`, `Value`) VALUES ('2000-01-01', 'key3', '1.1');

-- Generate 16777216 rows
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;
INSERT INTO `test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test;

-- Test
-- The following two queries are identical in result but different in method

-- This query completes in 9 seconds on a test machine
SELECT count(*) FROM (SELECT distinct Key3 FROM test) t;

-- This query completes in 110 seconds on the same test machine
-- With larger data set it never completes (or more than several hours)
SELECT count(distinct Key3) FROM (SELECT Key3 FROM test) t;

-- The two queries complete approx. the same time on MySQL 8.0.16
-- The problem still exists with MySQL 8.0.19
===== script end =====
[4 Feb 2020 16:34] Andras Janurik
I am sorry, I mixed up the last two test queries in my previous comment, here is the test section properly:

-- Test
-- The following two queries are identical in result but different in method

-- This query completes in 9 seconds on a test machine
SELECT count(distinct Key3) FROM (SELECT Key3 FROM test) t;

-- This query completes in 110 seconds on the same test machine
-- With larger data set it never completes (or more than several hours)
SELECT count(*) FROM (SELECT distinct Key3 FROM test) t;

-- The two queries complete approx. the same time on MySQL 8.0.16
-- The problem still exists with MySQL 8.0.19
===== script end =====
[4 Feb 2020 17:21] MySQL Verification Team
Hello Mr. Janurik,

Since you are using InnoDB SE, I am afraid that this is a very well known problem and not a bug.

Simply, InnoDB is a MVCC SE, which means that it has to scan the entire table for each SELECT count(*) .... command. That is why the performance is slow. Scanning a huge table (temporary or not), takes lot's of time, sometimes hours.

That is a reason why most programmers query the I_S table for the approximate number of rows. In your case, it would be a correct value, since two connections can not use a single temporary table.

Difference between older and newer 8.0 release is due to the changes in the temporary tables engine.

Not a bug.
[5 Feb 2020 9:31] Andras Janurik
Mr. Milivojevic,

I understand what you say but I don't think count(*) is the real problem here. Actually in our original query we did not use count(*), I just made it up for the test.

Let me show you the original query.
Another table is needed for the test:

CREATE TABLE `key3_list` (
  `Key3` VARCHAR(20) NOT NULL,
  PRIMARY KEY (`Key3`)
) ENGINE=INNODB DEFAULT CHARSET=UTF8MB4;

INSERT INTO `key3_list` (`Key3`) VALUES ('key3');

Test query #1:

SELECT t.Key3 FROM 
(SELECT distinct Key3 FROM test) t
LEFT JOIN key3_list k USING (Key3) WHERE k.Key3 IS NULL;

Test query #2:

SELECT distinct t.Key3 FROM 
(SELECT Key3 FROM test) t
LEFT JOIN key3_list k USING (Key3) WHERE k.Key3 IS NULL;

The only difference between the two queries is that the distinct is in the main query or the sub-query. (This is the main point of my bug report.)

Run times:

8.0.16
Query #1: 9 sec
Query #2: 21 sec

This is logical because the join is made on a shorter list in #1.

8.0.19
Query #1: 107 sec
Query #2: 21 sec

This is not logical because the join is made on a shorter list in #1.
Why is #1 so much slower on 8.0.19?

IMPORTANT
Run time escalates exponentially with larger data set.

I added another 4M rows to the table:
INSERT INTO `temp`.`test` (`Key2`, `Key3`, `Value`) select Key2, Key3, Value FROM temp.test limit 4000000;

8.0.19
Query #1: 225 sec
Query #2: 29 sec

So data size is 25% larger but #1 run time is more than double.

Can you please help me understand this?
[5 Feb 2020 13:09] MySQL Verification Team
Hi Mr. Janurik,

First of all, you are coming now with a completely different proposition. Your queries are changed so much that the causes of the execution speed differences are totally different. In short, this is completely different bug report.

There are two different possibilities for this regression. First one is that there are three internal bugs fixed between 8.0.16 and 8.0.18 that might have influenced the performance regression. Second, 8.0.18 has seen the introduction of the hash joins.

Hence, we need the explanation for the execution plan for the first query in both 8.0.16 and 8.0.18.

In order to get that info, we need extended EXPLAIN outputs for that query with both releases. Please, use the default TABLE format output, since there is a bug in other outputs, when dealing with hash joins.

We shall wait on your feedback.
[5 Feb 2020 13:50] Andras Janurik
Mr. Milivojevic,

> First of all, you are coming now with a completely different proposition. Your queries are changed so much that the causes of the execution speed differences are totally different. In short, this is completely different bug report.

I do not agree with this. My point was from the very beginning that distinct breaks the query when in the sub-query. It is true regardless of what I use in the main query, whether it is a "select *" or a "select count(*)". In my opinion it is the same bug report.

> In order to get that info, we need extended EXPLAIN outputs for that query with both releases. Please, use the default TABLE format output, since there is a bug in other outputs, when dealing with hash joins.

Can you please show me how to produce this output?
[5 Feb 2020 13:53] MySQL Verification Team
There is a chapter in our 8.0 Reference Manual only on EXPLAIN. 

Entire syntax is explained there.
[5 Feb 2020 14:14] Andras Janurik
Extended explain output MySQL Workbench csv export

EXPLAIN
SELECT t.Key3 FROM 
(SELECT DISTINCT Key3 FROM test) t
LEFT JOIN key3_list k USING (Key3) WHERE k.Key3 IS NULL;

SHOW WARNINGS;

=== MySQL 8.0.16 ===

EXPLAIN
id,select_type,table,partitions,type,possible_keys,key,key_len,ref,rows,filtered,Extra
1,PRIMARY,<derived2>,NULL,ALL,NULL,NULL,NULL,NULL,15636644,100.00,NULL
1,PRIMARY,k,NULL,eq_ref,PRIMARY,PRIMARY,82,t.Key3,1,100.00,"Using where; Not exists; Using index"
2,DERIVED,test,NULL,index,PRIMARY,PRIMARY,89,NULL,15636644,100.00,"Using index; Using temporary" 

WARNINGS
Level,Code,Message
Note,1003,"/* select#1 */ select `t`.`Key3` AS `Key3` from (/* select#2 */ select distinct `temp`.`test`.`Key3` AS `Key3` from `temp`.`test`) `t` left join `temp`.`key3_list` `k` on((`temp`.`k`.`Key3` = `t`.`Key3`)) where isnull(`temp`.`k`.`Key3`)" 

=== MySQL 8.0.19 ===

EXPLAIN
id,select_type,table,partitions,type,possible_keys,key,key_len,ref,rows,filtered,Extra
1,PRIMARY,<derived2>,NULL,ALL,NULL,NULL,NULL,NULL,19980734,100.00,NULL
1,PRIMARY,k,NULL,eq_ref,PRIMARY,PRIMARY,82,t.Key3,1,100.00,"Using where; Not exists; Using index"
2,DERIVED,test,NULL,index,PRIMARY,PRIMARY,89,NULL,19980734,100.00,"Using index; Using temporary" 

WARNINGS
Level,Code,Message
Note,1003,"/* select#1 */ select `t`.`Key3` AS `Key3` from (/* select#2 */ select distinct `temp`.`test`.`Key3` AS `Key3` from `temp`.`test`) `t` left join `temp`.`key3_list` `k` on((`temp`.`k`.`Key3` = `t`.`Key3`)) where (`temp`.`k`.`Key3` is null)"
[5 Feb 2020 14:39] MySQL Verification Team
Hi!

Thank you for the output.

Do you have any idea of why 8.0.16 sees 16 million rows, while 8.0.19 sees almost 20 million rows ????
[5 Feb 2020 14:44] Andras Janurik
Hi,

You are welcome for the output.

Regarding the 20 million rows: in my previous comment [5 Feb 9:31] I wrote that I added another 4M rows to the 16M row table to show that run time escalates exponentially with larger data set.

I can remove it if it is a problem.
[5 Feb 2020 14:48] MySQL Verification Team
No, it is not a problem. Don't remove ......
[5 Feb 2020 15:05] MySQL Verification Team
Hi,

There is a possibility that this report is a duplicate of an internally reported bug, that is not visible to you.

If there are less then 50.000 rows in the output of the query, can you run this command and then run your first query again and measure the time. 

First use this command on 8.0.19 server, all from the same connection:

SET SESSION internal_tmp_mem_storage_engine=MEMORY;

Then run your first query and measure time. After this, you can close that session.

Just report the time back to us, please ....
[5 Feb 2020 15:50] Andras Janurik
Hi,

This fixed it.
The query run time changed from 135 sec to 18 sec.

How can I know when this fix is released?

In the mean time shall I use this workaround? Is it safe to use (under 50k)?

Thanks
[5 Feb 2020 16:18] MySQL Verification Team
So, based on all test cases, this bug is a duplicate of an internal bug.

You can use this workaround for small result sets. That depends on the RAM that you have.

When the bug is fixed, this page will be updated.
[15 Jul 2020 15:49] Andras Janurik
Hi, 

Can you please let me know if this bug was fixed?
8.0.20 and 8.0.21 was released in the mean time but I cannot tell based on the change logs whether it was fixed.

Thanks
[15 Jul 2020 17:07] MySQL Verification Team
I haven't tested this testcase myself, but this seems relevant:

https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-21.html

"InnoDB: Chunk size was not calculated correctly when deallocating memory from the TempTable storage engine, causing a regression in SELECT DISTINCT query performance. (Bug #30562964)"
[16 Jul 2020 14:11] Andras Janurik
Hi,

Yes, it seems like the problem is gone, query execution times look good.

Interestingly, insert execution times are much higher. The test script that inserts the 16 million row test data now takes twice as long. But that is a different issue I guess.
[16 Jul 2020 14:14] MySQL Verification Team
Thank you, Mr. Janurik.