Bug #119463 some charactor may be truncated use utf8mb4_general_ci collation_connection
Submitted: 26 Nov 2025 6:23
Reporter: zhang xiaojian Email Updates:
Status: Open Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:9.5.0 OS:Any
Assigned to: CPU Architecture:Any
Tags: utf8mb4_0900_ai_ci

[26 Nov 2025 6:23] zhang xiaojian
Description:
when use utf8mb4_0900_ai_ci, a simple string may be truncated.

txsql> select lower('aaaȾbbb');
+-------------------+
| lower('aaaȾbbb')  |
+-------------------+
| aaaⱦ              |
+-------------------+
1 row in set (2 min 29.08 sec)

mysql> show variables like "%colla%";
+-------------------------------+--------------------+
| Variable_name                 | Value              |
+-------------------------------+--------------------+
| collation_connection          | utf8mb4_0900_ai_ci |
| collation_database            | utf8mb4_0900_ai_ci |
| collation_server              | utf8mb4_0900_ai_ci |
| default_collation_for_utf8mb4 | utf8mb4_0900_ai_ci |
+-------------------------------+--------------------+
4 rows in set (0.028 sec)

utf8mb4_general_ci not truncated bug cann't translate to lower character.

mysql> set names 'utf8mb4' collate 'utf8mb4_general_ci';
Query OK, 0 rows affected (0.002 sec)

mysql> show variables like "%colla%";
+-------------------------------+--------------------+
| Variable_name                 | Value              |
+-------------------------------+--------------------+
| collation_connection          | utf8mb4_general_ci |
| collation_database            | utf8mb4_0900_ai_ci |
| collation_server              | utf8mb4_0900_ai_ci |
| default_collation_for_utf8mb4 | utf8mb4_0900_ai_ci |
+-------------------------------+--------------------+
4 rows in set (0.005 sec)

mysql> select lower('aaaȾbbb');
+-------------------+
| lower('aaaȾbbb')  |
+-------------------+
| aaaȾbbb           |
+-------------------+
1 row in set (0.001 sec)

How to repeat:
See Description.

Suggested fix:
In function my_casedn_utf8mb4, src and dst point to same memory, but my_tolower_utf8mb4 may change the sizeof character. 

gdb :

```
(gdb) n
(gdb) p src
$9 = 0x7f66e2fa8ce3 "��bbb"
(gdb) p dst
$10 = 0x7f66e2fa8ce3 "��bbb"
(gdb) n
(gdb) p srcres
$11 = 2
(gdb) p dstres
$12 = 3
(gdb)   
```

we can see after my_tolower_utf8mb4, dstres is 3 but srcres is 2.

after use gdb hack to change srcres to 3, we got the expect result
```
gdb) p srcres
$14 = 3
(gdb) p dstres
$15 = 3
```

```
txsql> select lower('aaaȾbbb');
+-------------------+
| lower('aaaȾbbb')  |
+-------------------+
| aaaⱦbb            |
+-------------------+
```
[25 Dec 2025 7:33] kai zhang
Hello MySQL Team,
I have identified and analyzed this bug, and I would like to share my findings and proposed solution.

# Root Cause Analysis
The string truncation issue occurs when using `utf8mb4_0900_ai_ci` collation with characters whose case conversion changes the UTF-8 byte length. 

Specifically:
Forward conversion :
- `Ⱦ` (U+023E, 2 bytes: 0xC8 0xBE) → `ⱦ` (U+2C66, 3 bytes: 0xE2 0xB1 0xA6)

The problem occurs in `my_casedn_utf8mb4()` functions in `strings/ctype-utf8.cc`:

1. When `casedn_multiply == 1`, the code assumes case conversion won't change byte length.
2. This triggers in-place conversion where `src == dst` (same buffer address)
3. When converting `Ⱦ` (2 bytes) to `ⱦ` (3 bytes), the extra byte overwrites the next character
4. This causes memory corruption and subsequent UTF-8 parsing failure, leading to truncation

# Memory corruption example:
Original: [a][a][a][Ⱦ:C8 BE][b][b][b] (8 bytes)
After: [a][a][a][ⱦ:E2 B1 A6][b][b] (corrupted)
Position 5 overwritten: 'b'(0x62) → 0xA6
Next read at position 5: 0xA6 is invalid UTF-8 start byte → parsing fails, so truncation occurs!
This bug also happens in other utf8mb4 0900 series collations.

# Similar bug:
mysql> select upper('aaaɀbbb');
+-------------------+
| upper('aaaɀbbb')  |
+-------------------+
| AAAⱿ              |
+-------------------+
1 row in set (0.00 sec)
mysql> SELECT BINARY lower('Ⱦ');
+----------------------------------------+
| BINARY lower('Ⱦ')                      |
+----------------------------------------+
| 0x                                     |
+----------------------------------------+
1 row in set, 1 warning (0.00 sec)
[25 Dec 2025 7:34] kai zhang
Suggested fix:

Attachment: bug#119463.patch (application/octet-stream, text), 3.15 KiB.

[25 Dec 2025 12:24] kai zhang
Suggested fix in tag mysql-9.5.0:

Attachment: 950.patch (application/octet-stream, text), 3.42 KiB.