Description:
When using LOWER() with utf8mb4_0900_ai_ci (or potentially other utf8mb4 collations) on a string containing characters that expand in byte length during conversion (e.g., Ⱦ U+023E, 2 bytes -> ⱦ U+2C66, 3 bytes), the result is truncated.
This appears to be caused by an incorrect casedn_multiply value (set to 1) in the corresponding CHARSET_INFO structure. This forces Item_func_lower to perform an in-place modification or allocate an insufficient buffer. Since the converted character is longer than the original, it overwrites the subsequent bytes. The source pointer (src) then advances by the original length (2 bytes), landing on the 3rd byte of the new character, which is incorrectly interpreted (likely as an invalid start byte), causing the string processing to terminate early.
How to repeat:
Execute the following SQL query:
SELECT LOWER(_utf8mb4'aaaȾbbb' COLLATE utf8mb4_0900_ai_ci);
Expected Result: aaaⱦbbb
Actual Result: aaaⱦ (The suffix bbb is lost/truncated)
Suggested fix:
Analysis / Root Cause:
1. Input Character: Ⱦ (U+023E) is encoded as 0xC8 0xBE (2 bytes) in UTF-8.
2. Output Character: ⱦ (U+2C66) is encoded as 0xE2 0xB1 0xA6 (3 bytes) in UTF-8.
3. The Issue: In the source code, the CHARSET_INFO structure for the collation (specifically my_charset_utf8mb4_0900_ai_ci or the underlying binary/handler it falls back to) defines casedn_multiply = 1.
Because the multiplier is 1, the allocator assumes the length will not increase and allows an in-place conversion strategy.
4. Execution Trace (deduced):
- my_casedn_utf8mb4 reads Ⱦ (2 bytes).
- It writes ⱦ (3 bytes) to the destination buffer. Since it is operating in-place (or with overlapping pointers), the 3rd byte (0xA6) overwrites the first character of the suffix (b).
- The loop increments the src pointer by the original length (src += 2).
- The src pointer now points to the address where 0xA6 (the 3rd byte of the new char) was just written.
- 0xA6 is not a valid UTF-8 start byte (continuation byte). The loop likely terminates or treats it as a binary break, leading to truncation.
Suggested Fix:
I have verified that changing the casedn_multiply value from 1 to 2 (or 3 for safety against larger expansions) in the CHARSET_INFO definition resolves the issue.
After recompiling MySQL with casedn_multiply = 2, the query returns the correct result: aaaⱦbbb