Description:
When a column is defined with utf8mb4 character set and the client uses a different character set (e.g., gbk), GROUP_CONCAT constructs the result string by concatenating raw bytes from each row's value without performing any character set conversion. The resulting string is then sent to the client assuming it is already encoded in the client's character set (gbk). This causes the client to decode the UTF-8 bytes as GBK, producing mojibake (garbled text).
The bug is triggered regardless of whether ORDER BY is used. The attached test case shows that the expected string "张三,李四,王五" becomes "寮犱笁,鏉鍥,鐜嬩簲" when the client character set is gbk. The corruption is not symmetrical: the second entry "李四" is corrupted differently, indicating that the concatenation buffer may have been mis-handled.
Stack trace (from debug build) points directly to Item_func_group_concat::val_str() which calls dump_leaf_key() and eventually String::append(). At that point, the source String (from the Item_field) has charset=utf8mb4, while the destination String (the GROUP_CONCAT accumulator) has charset=gbk. String::append() simply memcpy() the bytes without transcoding, which is incorrect for different character sets.
How to repeat:
1. Start a MySQL client with default character set set to gbk (or execute SET NAMES gbk).
2. Run the following SQL statements exactly:
SET NAMES gbk;
DROP TABLE IF EXISTS t_utf8;
CREATE TABLE t_utf8 (
id INT PRIMARY KEY,
name VARCHAR(50) CHARACTER SET utf8mb4
) DEFAULT CHARSET=utf8mb4;
INSERT INTO t_utf8 VALUES
(1, _utf8 X'E5BCA0E4B889'), -- '张三'
(2, _utf8 X'E69D8EE59B9B'), -- '李四'
(3, _utf8 X'E78E8BE4BA94'); -- '王五'
SELECT GROUP_CONCAT(name ORDER BY id SEPARATOR ',') AS result FROM t_utf8;
Expected result:
+----------------------+
| result |
+----------------------+
| 张三,李四,王五 |
+----------------------+
Actual result (observed on MySQL 8.0.32, Ubuntu 20.04):
+----------------------+
| result |
+----------------------+
| 寮犱笁,鏉鍥,鐜嬩簲 |
+----------------------+
Note: The exact garbled characters may vary depending on the terminal encoding, but the underlying bytes sent over the protocol are the raw UTF-8 bytes (E5BCA0E4B889...) instead of the expected GBK bytes (D5C5C8FD...).
Diagnosis:
In debug builds, an assertion can be added in String::append() to check that the two Strings have the same character set. This assertion will fail, confirming the bug.
Stack trace (simplified):
#0 String::append (this=0x..., s=...) at sql_string.cpp:462
this->m_charset = &my_charset_gbk_chinese_ci
s.m_charset = &my_charset_utf8mb4_bin
#1 dump_leaf_key (key_arg=..., ...) at item_sum.cpp:4738
#2 tree_walk_left_root_right ...
#3 Item_func_group_concat::val_str (this=..., str=...) at item_sum.cpp:5459
Suggested fix:
Item_func_group_concat should convert each input value to the aggregation character set before appending. The aggregation result should have the character set of the current session's collation_connection or a proper default. Currently, the result String is created with the connection character set (gbk in this case), but the raw input bytes are not converted. A correct implementation would use String::append() only when both Strings have the same charset; otherwise, perform a character set conversion via my_convert() or use Item::val_str() with a buffer already in the target charset.
Workaround:
Avoid using GROUP_CONCAT when the column character set differs from the connection character set. Alternatively, convert the column explicitly before aggregation:
SELECT GROUP_CONCAT(CONVERT(name USING gbk) ORDER BY id SEPARATOR ',') FROM t_utf8;
This forces the conversion to happen per row and produces correct results.
Additional notes:
This bug also affects XMLAGG (as reported in earlier discussions) and likely other aggregate functions that concatenate strings. The root cause is the same: improper handling of character set conversion in the aggregation accumulator.
Description: When a column is defined with utf8mb4 character set and the client uses a different character set (e.g., gbk), GROUP_CONCAT constructs the result string by concatenating raw bytes from each row's value without performing any character set conversion. The resulting string is then sent to the client assuming it is already encoded in the client's character set (gbk). This causes the client to decode the UTF-8 bytes as GBK, producing mojibake (garbled text). The bug is triggered regardless of whether ORDER BY is used. The attached test case shows that the expected string "张三,李四,王五" becomes "寮犱笁,鏉鍥,鐜嬩簲" when the client character set is gbk. The corruption is not symmetrical: the second entry "李四" is corrupted differently, indicating that the concatenation buffer may have been mis-handled. Stack trace (from debug build) points directly to Item_func_group_concat::val_str() which calls dump_leaf_key() and eventually String::append(). At that point, the source String (from the Item_field) has charset=utf8mb4, while the destination String (the GROUP_CONCAT accumulator) has charset=gbk. String::append() simply memcpy() the bytes without transcoding, which is incorrect for different character sets. How to repeat: 1. Start a MySQL client with default character set set to gbk (or execute SET NAMES gbk). 2. Run the following SQL statements exactly: SET NAMES gbk; DROP TABLE IF EXISTS t_utf8; CREATE TABLE t_utf8 ( id INT PRIMARY KEY, name VARCHAR(50) CHARACTER SET utf8mb4 ) DEFAULT CHARSET=utf8mb4; INSERT INTO t_utf8 VALUES (1, _utf8 X'E5BCA0E4B889'), -- '张三' (2, _utf8 X'E69D8EE59B9B'), -- '李四' (3, _utf8 X'E78E8BE4BA94'); -- '王五' SELECT GROUP_CONCAT(name ORDER BY id SEPARATOR ',') AS result FROM t_utf8; Expected result: +----------------------+ | result | +----------------------+ | 张三,李四,王五 | +----------------------+ Actual result (observed on MySQL 8.0.32, Ubuntu 20.04): +----------------------+ | result | +----------------------+ | 寮犱笁,鏉鍥,鐜嬩簲 | +----------------------+ Note: The exact garbled characters may vary depending on the terminal encoding, but the underlying bytes sent over the protocol are the raw UTF-8 bytes (E5BCA0E4B889...) instead of the expected GBK bytes (D5C5C8FD...). Diagnosis: In debug builds, an assertion can be added in String::append() to check that the two Strings have the same character set. This assertion will fail, confirming the bug. Stack trace (simplified): #0 String::append (this=0x..., s=...) at sql_string.cpp:462 this->m_charset = &my_charset_gbk_chinese_ci s.m_charset = &my_charset_utf8mb4_bin #1 dump_leaf_key (key_arg=..., ...) at item_sum.cpp:4738 #2 tree_walk_left_root_right ... #3 Item_func_group_concat::val_str (this=..., str=...) at item_sum.cpp:5459 Suggested fix: Item_func_group_concat should convert each input value to the aggregation character set before appending. The aggregation result should have the character set of the current session's collation_connection or a proper default. Currently, the result String is created with the connection character set (gbk in this case), but the raw input bytes are not converted. A correct implementation would use String::append() only when both Strings have the same charset; otherwise, perform a character set conversion via my_convert() or use Item::val_str() with a buffer already in the target charset. Workaround: Avoid using GROUP_CONCAT when the column character set differs from the connection character set. Alternatively, convert the column explicitly before aggregation: SELECT GROUP_CONCAT(CONVERT(name USING gbk) ORDER BY id SEPARATOR ',') FROM t_utf8; This forces the conversion to happen per row and produces correct results. Additional notes: This bug also affects XMLAGG (as reported in earlier discussions) and likely other aggregate functions that concatenate strings. The root cause is the same: improper handling of character set conversion in the aggregation accumulator.