Bug #38720 SELECT DISTINCT on multiple attributes containing spaces in UTF8 are slow.
Submitted: 11 Aug 2008 14:07 Modified: 18 Jan 2018 13:11
Reporter: Roalt Zijlstra Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:5.0, 5.1, 6.0 bzr OS:Linux (Etch 4.0(am64 and x86))
Assigned to: Assigned Account CPU Architecture:Any
Tags: distinct, regression, SELECT, utf8

[11 Aug 2008 14:07] Roalt Zijlstra
Description:
Hi there,

We came across a bug in an query we are using for a longer time on 'latin1' tables. The bug comes around when using the same query on 'UTF8' or 'gb2312' tables. 

We narrowed it down to this:
We have a table with an attribute containing  empty strings or a strings with one or more spaces and a second attribute with a lot of different (email) entries. When doing a distinct select on this utf8 table the query starts a very cpu intensive calculation and the query takes a very long time to finish. 
With 10.000 entries (4312 are unique) it takes 11 seconds instead of 0.07 seconds

The query we use to test and produce the problem is the following one:

select DISTINCT email,name  from tmp_result;
4312 rows in set (23.42 sec)

In our case the name attribute could contain an empty string and cause the delay. We solved it with a work around by adding an if stamement.

select DISTINCT email, if(trim(name) = '', null, name)  from tmp_result;
4312 rows in set (0.07 sec)

I have tested this also  with the static binary of 5.0.51a and 5.1.26. It doesn't happen in 'latin1' tables.

How to repeat:
Create a small table like this:

CREATE TABLE `tmp_result` (
  `email` varchar(255) NOT NULL default '',
  `name` varchar(255) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

Feed it with 10000 records where email is random data and name is '' or ' '. IN the end I used random md5 records of the email addresses to exclude eventual broken email addresses with invalid characters.

Then issue:
select DISTINCT email,name  from tmp_result;

Suggested fix:
I don't have a clue how to fix it. We did use the workaround :

if(trim(name) = '', null, name)

But this returns null instead of ''. This however speeds up the query to the level what I expected.
[11 Aug 2008 14:20] Roalt Zijlstra
Hmm I mentioned 11 seconds... but that was in fact 23 seconds as stated in the SQL result.
[11 Aug 2008 17:33] Valeriy Kravchuk
Thank you for a problem report. Please, try to repeat with a newer version, 5.0.67, and inform about the results.
[12 Aug 2008 8:03] Roalt Zijlstra
I now also tested it with the 5.0.67 release glib23, but that one is not faster or slower then the earlier reported MySQL version.

I recreated the testcase on a test server and there the Debian 5.0.32 MySQL takes 1 min 35.84 sec. The 5.0.67 takes 1 min 36.62 sec. 

The if conditioned query takes only 0.23 seconds. 

Extra info: if I do 'SELECT DISTICT name,email from tmp_result;' Then the query is fast, but that still doesn't resolve my problem, because users can create the 'SQL query' themselves.
[12 Aug 2008 10:37] Sveta Smirnova
Thank you for the report.

Verified as described using test case below (to get more significant results add more rows). Version 4.1 is not affectd.

Test case:

 CREATE TABLE `tmp_result` (
  `email` varchar(255) NOT NULL default '',
  `name` varchar(255) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

insert into tmp_result values(md5(rand()), ' ');
insert into tmp_result select md5(rand()), name from tmp_result;
insert into tmp_result select md5(rand()), name from tmp_result;
insert into tmp_result select md5(rand()), name from tmp_result;
insert into tmp_result select md5(rand()), name from tmp_result;
insert into tmp_result select md5(rand()), name from tmp_result;
insert into tmp_result select md5(rand()), name from tmp_result;
insert into tmp_result select md5(rand()), name from tmp_result;
insert into tmp_result select md5(rand()), name from tmp_result;
insert into tmp_result select md5(rand()), name from tmp_result;
insert into tmp_result select md5(rand()), name from tmp_result;
insert into tmp_result select md5(rand()), name from tmp_result;
insert into tmp_result select md5(rand()), name from tmp_result;

select now();
--disable_result_log
select DISTINCT email,name  from tmp_result;
--enable_result_log
select now();
[18 Jan 2018 13:11] Erlend Dahl
This was fixed in 5.7.8.