Bug #23451 GROUP_CONCAT truncates a multibyte utf8 character
Submitted: 19 Oct 2006 2:50 Modified: 17 Nov 2006 18:19
Reporter: Steven Wong Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:5.0.27-BK, 5.0.22 OS:Linux (Linux)
Assigned to: Alexander Barkov CPU Architecture:Any
Tags: GROUP_CONCAT multibyte utf8

[19 Oct 2006 2:50] Steven Wong
Description:
GROUP_CONCAT of a utf8 column truncates in the middle of a multibyte character, if the character is at the boundary where GROUP_CONCAT cuts off the concatenated string according to the setting group_concat_max_len. This produces an invalid utf8 character in the result string.

How to repeat:
The following simple script reproduces the problem.

-- Prerequisite: MySQL server's group_concat_max_len must not be configured, so
-- that the default value (1024) is in use.

SET NAMES utf8;

DROP TABLE IF EXISTS TEST_GROUP_CONCAT;

CREATE TABLE TEST_GROUP_CONCAT (
X VARCHAR(2048) CHARACTER SET utf8 NOT NULL,
Y INTEGER NOT NULL
)
ENGINE=InnoDB;

INSERT INTO TEST_GROUP_CONCAT VALUES
(REPEAT('a', 1022), 0),
(_utf8 x'c3b7', 0);

SELECT X, CHAR_LENGTH(X), LENGTH(X), CHARSET(X)
FROM TEST_GROUP_CONCAT;

SET group_concat_max_len = 1024; -- also try 1023 and 1025 for comparison

SELECT GROUP_CONCAT(X), CHAR_LENGTH(GROUP_CONCAT(X)), LENGTH(GROUP_CONCAT(X)),
CHARSET(GROUP_CONCAT(X)), RIGHT(HEX(GROUP_CONCAT(X)), 10)
FROM TEST_GROUP_CONCAT GROUP BY Y;

DROP TABLE TEST_GROUP_CONCAT;

Suggested fix:
GROUP_CONCAT does not truncate any multibyte character in any character encoding. Also, group_concat_max_len should be interpreted as having units of characters (not bytes), if it is not already.
[19 Oct 2006 7:33] Valeriy Kravchuk
Thank you for a bug report. Verified just as described with 5.0.27-BK on Linux.
[20 Oct 2006 11:49] Alexander Barkov
mysql-4.1 is affected as well.
[23 Oct 2006 7:37] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/14151

ChangeSet@1.2562, 2006-10-23 12:38:18+05:00, bar@mysql.com +4 -0
  Bug#23451 GROUP_CONCAT truncates a multibyte utf8 character
  
  Problem: GROUP_CONCAT on a multi-byte column can truncate
  in the middle of a multibyte character when applying
  group_concat_max_len limit. It produces an invalid
  multi-byte character in the result string.
  
  Fix: allow to truncate only on character boundaries.
[7 Nov 2006 8:44] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/14930

ChangeSet@1.2536, 2006-11-07 12:45:48+04:00, bar@mysql.com +3 -0
  Bug#23451 GROUP_CONCAT truncates a multibyte utf8 character
    
    Problem: GROUP_CONCAT on a multi-byte column can truncate
    in the middle of a multibyte character when applying
    group_concat_max_len limit. It produces an invalid
    multi-byte character in the result string.
    
  The second, easier version - reusing old "warning_for_row" flag,
  instead of introducing of "result_is_full" - which was
  added in the previous commit.
[16 Nov 2006 16:48] Alexander Barkov
Appeared in 4.1.13 common
Appeared in 5.0.32 common
Appeared in 5.1.14 common
[17 Nov 2006 18:19] Paul DuBois
Noted in 4.1.13, 5.0.32, 5.1.14 changelogs.

When applying the group_concat_max_len limit, GROUP_CONCAT() could
truncate multi-byte characters in the middle.