Bug #119908 GROUP_CONCAT returns corrupted characters when column charset is utf8mb4 and connection charset is gbk
Submitted: 12 Feb 12:49 Modified: 13 Feb 7:26
Reporter: huanlong wang Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:8.0.34 OS:Any
Assigned to: CPU Architecture:Any

[12 Feb 12:49] huanlong wang
Description:
When a column is defined with utf8mb4 character set and the client uses a different character set (e.g., gbk), GROUP_CONCAT constructs the result string by concatenating raw bytes from each row's value without performing any character set conversion. The resulting string is then sent to the client assuming it is already encoded in the client's character set (gbk). This causes the client to decode the UTF-8 bytes as GBK, producing mojibake (garbled text).

The bug is triggered regardless of whether ORDER BY is used. The attached test case shows that the expected string "张三,李四,王五" becomes "寮犱笁,鏉鍥,鐜嬩簲" when the client character set is gbk. The corruption is not symmetrical: the second entry "李四" is corrupted differently, indicating that the concatenation buffer may have been mis-handled.

Stack trace (from debug build) points directly to Item_func_group_concat::val_str() which calls dump_leaf_key() and eventually String::append(). At that point, the source String (from the Item_field) has charset=utf8mb4, while the destination String (the GROUP_CONCAT accumulator) has charset=gbk. String::append() simply memcpy() the bytes without transcoding, which is incorrect for different character sets.

How to repeat:
1. Start a MySQL client with default character set set to gbk (or execute SET NAMES gbk).
2. Run the following SQL statements exactly:

SET NAMES gbk;

DROP TABLE IF EXISTS t_utf8;
CREATE TABLE t_utf8 (
    id INT PRIMARY KEY,
    name VARCHAR(50) CHARACTER SET utf8mb4
) DEFAULT CHARSET=utf8mb4;

INSERT INTO t_utf8 VALUES 
    (1, _utf8 X'E5BCA0E4B889'),  -- '张三'
    (2, _utf8 X'E69D8EE59B9B'),  -- '李四'
    (3, _utf8 X'E78E8BE4BA94');  -- '王五'

SELECT GROUP_CONCAT(name ORDER BY id SEPARATOR ',') AS result FROM t_utf8;

Expected result:
+----------------------+
| result               |
+----------------------+
| 张三,李四,王五      |
+----------------------+

Actual result (observed on MySQL 8.0.32, Ubuntu 20.04):
+----------------------+
| result               |
+----------------------+
| 寮犱笁,鏉鍥,鐜嬩簲 |
+----------------------+

Note: The exact garbled characters may vary depending on the terminal encoding, but the underlying bytes sent over the protocol are the raw UTF-8 bytes (E5BCA0E4B889...) instead of the expected GBK bytes (D5C5C8FD...).

Diagnosis:

In debug builds, an assertion can be added in String::append() to check that the two Strings have the same character set. This assertion will fail, confirming the bug.

Stack trace (simplified):
#0  String::append (this=0x..., s=...) at sql_string.cpp:462
    this->m_charset = &my_charset_gbk_chinese_ci
    s.m_charset     = &my_charset_utf8mb4_bin
#1  dump_leaf_key (key_arg=..., ...) at item_sum.cpp:4738
#2  tree_walk_left_root_right ...
#3  Item_func_group_concat::val_str (this=..., str=...) at item_sum.cpp:5459

Suggested fix:
Item_func_group_concat should convert each input value to the aggregation character set before appending. The aggregation result should have the character set of the current session's collation_connection or a proper default. Currently, the result String is created with the connection character set (gbk in this case), but the raw input bytes are not converted. A correct implementation would use String::append() only when both Strings have the same charset; otherwise, perform a character set conversion via my_convert() or use Item::val_str() with a buffer already in the target charset.

Workaround:

Avoid using GROUP_CONCAT when the column character set differs from the connection character set. Alternatively, convert the column explicitly before aggregation:

SELECT GROUP_CONCAT(CONVERT(name USING gbk) ORDER BY id SEPARATOR ',') FROM t_utf8;

This forces the conversion to happen per row and produces correct results.

Additional notes:

This bug also affects XMLAGG (as reported in earlier discussions) and likely other aggregate functions that concatenate strings. The root cause is the same: improper handling of character set conversion in the aggregation accumulator.
[12 Feb 20:16] Roy Lyseng
Thank you for the bug report.
It is indeed wrong to mix strings with different character sets in one operation, without conversion into a common character set.

Your explanation is slightly wrong, though. The character set for the result of GROUP_CONCAT is supposed to be derived from the character set of the function's argument(s), in this case that should be utf8mb4. The result string should however be converted to gbk before being sent to the client.
[13 Feb 1:50] huanlong wang
Hi Roy,

You're right – I've revisited this and confirmed that the issue is not in GROUP_CONCAT itself.
It was a misinterpretation on my side. Please close this bug as "Not a Bug".

Sorry for the noise.

Best regards,
huanlong wang
[13 Feb 7:26] Roy Lyseng
No problem...

But there is a small issue regarding the use of character sets in the implementation of GROUP_CONCAT, thus I will keep this report open.

Thank you for your feedback.