| Bug #119463 | some charactor may be truncated use utf8mb4_general_ci collation_connection | ||
|---|---|---|---|
| Submitted: | 26 Nov 2025 6:23 | ||
| Reporter: | zhang xiaojian | Email Updates: | |
| Status: | Open | Impact on me: | |
| Category: | MySQL Server: Charsets | Severity: | S2 (Serious) |
| Version: | 9.5.0 | OS: | Any |
| Assigned to: | CPU Architecture: | Any | |
| Tags: | utf8mb4_0900_ai_ci | ||
[25 Dec 2025 7:33]
kai zhang
Hello MySQL Team,
I have identified and analyzed this bug, and I would like to share my findings and proposed solution.
# Root Cause Analysis
The string truncation issue occurs when using `utf8mb4_0900_ai_ci` collation with characters whose case conversion changes the UTF-8 byte length.
Specifically:
Forward conversion :
- `Ⱦ` (U+023E, 2 bytes: 0xC8 0xBE) → `ⱦ` (U+2C66, 3 bytes: 0xE2 0xB1 0xA6)
The problem occurs in `my_casedn_utf8mb4()` functions in `strings/ctype-utf8.cc`:
1. When `casedn_multiply == 1`, the code assumes case conversion won't change byte length.
2. This triggers in-place conversion where `src == dst` (same buffer address)
3. When converting `Ⱦ` (2 bytes) to `ⱦ` (3 bytes), the extra byte overwrites the next character
4. This causes memory corruption and subsequent UTF-8 parsing failure, leading to truncation
# Memory corruption example:
Original: [a][a][a][Ⱦ:C8 BE][b][b][b] (8 bytes)
After: [a][a][a][ⱦ:E2 B1 A6][b][b] (corrupted)
Position 5 overwritten: 'b'(0x62) → 0xA6
Next read at position 5: 0xA6 is invalid UTF-8 start byte → parsing fails, so truncation occurs!
This bug also happens in other utf8mb4 0900 series collations.
# Similar bug:
mysql> select upper('aaaɀbbb');
+-------------------+
| upper('aaaɀbbb') |
+-------------------+
| AAAⱿ |
+-------------------+
1 row in set (0.00 sec)
mysql> SELECT BINARY lower('Ⱦ');
+----------------------------------------+
| BINARY lower('Ⱦ') |
+----------------------------------------+
| 0x |
+----------------------------------------+
1 row in set, 1 warning (0.00 sec)
[25 Dec 2025 7:34]
kai zhang
Suggested fix:
Attachment: bug#119463.patch (application/octet-stream, text), 3.15 KiB.

Description: when use utf8mb4_0900_ai_ci, a simple string may be truncated. txsql> select lower('aaaȾbbb'); +-------------------+ | lower('aaaȾbbb') | +-------------------+ | aaaⱦ | +-------------------+ 1 row in set (2 min 29.08 sec) mysql> show variables like "%colla%"; +-------------------------------+--------------------+ | Variable_name | Value | +-------------------------------+--------------------+ | collation_connection | utf8mb4_0900_ai_ci | | collation_database | utf8mb4_0900_ai_ci | | collation_server | utf8mb4_0900_ai_ci | | default_collation_for_utf8mb4 | utf8mb4_0900_ai_ci | +-------------------------------+--------------------+ 4 rows in set (0.028 sec) utf8mb4_general_ci not truncated bug cann't translate to lower character. mysql> set names 'utf8mb4' collate 'utf8mb4_general_ci'; Query OK, 0 rows affected (0.002 sec) mysql> show variables like "%colla%"; +-------------------------------+--------------------+ | Variable_name | Value | +-------------------------------+--------------------+ | collation_connection | utf8mb4_general_ci | | collation_database | utf8mb4_0900_ai_ci | | collation_server | utf8mb4_0900_ai_ci | | default_collation_for_utf8mb4 | utf8mb4_0900_ai_ci | +-------------------------------+--------------------+ 4 rows in set (0.005 sec) mysql> select lower('aaaȾbbb'); +-------------------+ | lower('aaaȾbbb') | +-------------------+ | aaaȾbbb | +-------------------+ 1 row in set (0.001 sec) How to repeat: See Description. Suggested fix: In function my_casedn_utf8mb4, src and dst point to same memory, but my_tolower_utf8mb4 may change the sizeof character. gdb : ``` (gdb) n (gdb) p src $9 = 0x7f66e2fa8ce3 "��bbb" (gdb) p dst $10 = 0x7f66e2fa8ce3 "��bbb" (gdb) n (gdb) p srcres $11 = 2 (gdb) p dstres $12 = 3 (gdb) ``` we can see after my_tolower_utf8mb4, dstres is 3 but srcres is 2. after use gdb hack to change srcres to 3, we got the expect result ``` gdb) p srcres $14 = 3 (gdb) p dstres $15 = 3 ``` ``` txsql> select lower('aaaȾbbb'); +-------------------+ | lower('aaaȾbbb') | +-------------------+ | aaaⱦbb | +-------------------+ ```