Bug #78383 Bug with group by & group_concat
Submitted: 9 Sep 2015 14:51 Modified: 10 Sep 2015 14:28
Reporter: T. M. Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: General Severity:S3 (Non-critical)
Version:5.5 / 5.6 OS:Any
Assigned to: CPU Architecture:Any
Tags: GROUP BY, group_concat

[9 Sep 2015 14:51] T. M.
Description:
Hello,

We spot a bug with group by & group_concat

How to repeat:
-- Step 1 create a temp table with 2 different rows of 1025 char (datatype is text)
drop temporary table if exists _test;
create temporary table _test as
select lpad('a', 1025, '0') as chaine
union
select lpad('b', 1025, '0');

-- This result is OK
select *
from _test
group by chaine;

-- In this query, all is grouped, it should not !
select group_concat(chaine)
from _test
group by chaine;

-- we found an ugly workaround...
select group_concat(chaine)
from _test
group by crc32(chaine);
[9 Sep 2015 15:36] MySQL Verification Team
I do not understand several statements in your report.

First of all, it is your firm intention to group by the entire LPAD(......) ??? What is your intent with grouping by a constant ???

Second of all, what do you mean by "All is grouped" ??????

Third, change your temporary table creation , which is grouped by a constant, with this small change:

create temporary table _test as select lpad('a', 1025, '0') as chaine union all select lpad('b', 1025, '0');
[9 Sep 2015 15:54] Hartmut Holzgraefe
Actual table creation doesn't matter, what matters is that TEXT entries longer than 1024 characters (or bytes?) that only differ at position 1025 or beyond are treated as being equal ...

Use RPAD instead of LPAD to generate the test table, or use a LPAD length parameter of <=1024 and you get a two row result set just fine. Only when using a pad length of >1024, which causes the first 1024 characters to be the same in both generated strings and only the last character to differ, GROUPY BY seems to miss the column content difference and so creates a single group instead of two ...
[9 Sep 2015 16:10] Hartmut Holzgraefe
It also only happens with TEXT, when changing the chaine column to be VARCHAR(2000) instead of TEXT results are two GROUP BY rows as expected.

So there seems to be a hard coded 1024 char/byte limit in the combination of GROUP_CONCAT()/GROUP BY. At first this looks like it might be group_concat_max_len, but the wrong grouping behavior persists when changing group_concat_max_len to a higher value like e.g. 4096

Other aggregate functions by themselves are not affected, e.g. with MAX(chaine) or COUNT(*) the GROUP BY query returns two rows as expected. 

Only when adding GROUP_CONCAT() to the mix any differences beyond the 1024th character get ignored ...
[10 Sep 2015 9:14] Hartmut Holzgraefe
Test case below. Expected result: all queries on t2 should return the same results as on t1. Actual result: on t2, where TEXT data type was used instead of VARCHAR, the queries containing GROUP_CONCAT() only return three group rows instead of four ...

-----8<-----

set session group_concat_max_len = 4096;

use test;
drop table if exists t1;
drop table if exists t2;

create table t1(id int primary key auto_increment, msg varchar(2000));
create table t2(id int primary key auto_increment, msg text);

insert into t1 select null, lpad('a', 1025, '0');
insert into t1 select null, lpad('b', 1025, '0');
insert into t1 select null, lpad('c', 1024, '0');
insert into t1 select null, lpad('d', 1024, '0');

insert into t2 select * from t1;

SELECT COUNT(*) FROM t1 GROUP BY msg;
SELECT MAX(msg) FROM t1 GROUP BY msg;
SELECT GROUP_CONCAT(msg) FROM t1 GROUP BY msg;
SELECT COUNT(*), MAX(msg), GROUP_CONCAT(msg) FROM t1 GROUP BY msg;

SELECT COUNT(*) FROM t2 GROUP BY msg;
SELECT MAX(msg) FROM t2 GROUP BY msg;
SELECT GROUP_CONCAT(msg) FROM t2 GROUP BY msg;
SELECT COUNT(*), MAX(msg), GROUP_CONCAT(msg) FROM t2 GROUP BY msg;
[10 Sep 2015 13:19] Hartmut Holzgraefe
It turned out the limiting factor is max_sort_length, which is documented behavior after all, but it's strange that this documented limitation only kicks in in the TEXT/GROUP_CONCAT()/GROUP BY scenario while in the other cases comparison was not limited by this ...
[10 Sep 2015 14:28] MySQL Verification Team
Harmut, my dear friend,

First of all, thank you for confirming that it is not a bug. Problem indeed comes from the combination of group_concat() with GROUP BY. As expected, resolved by max_sort_length, which is the only cure available. 

So, thank you very much for your participation.