| Bug #113537 | False match of multibyte chars when checking existence of a control character | ||
|---|---|---|---|
| Submitted: | 2 Jan 2024 16:00 | Modified: | 4 Jan 2024 7:57 |
| Reporter: | Michael Olลกavskรฝ | Email Updates: | |
| Status: | Verified | Impact on me: | |
| Category: | MySQL Server: Charsets | Severity: | S3 (Non-critical) |
| Version: | 8.0.32 | OS: | Any |
| Assigned to: | CPU Architecture: | Any | |
| Tags: | charset, nul byte, UTF-8 | ||
[3 Jan 2024 13:31]
Bernt Marius Johnsen
1: CHAR('๐2' USING utf8) does not give the string '๐2' since CHAR() expects an integer argument. (In this case it is equivalent with CHAR(0 USING utf8). See the documentation of CHAR(): https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_char
2: POSITION/LOCATE does not return TRUE/FALSE but a position. See https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_locate
But: There is a bug here. A simpler example:
mysql> select LOCATE(x'00','abcdef');
+------------------------+
| LOCATE(x'00','abcdef') |
+------------------------+
| 0 |
+------------------------+
1 row in set (0.00 sec)
mysql> select LOCATE(x'00','abcdรฉf');
+-------------------------+
| LOCATE(x'00','abcdรฉf') |
+-------------------------+
| 5 |
+-------------------------+
1 row in set (0.00 sec)
The first select is correct, the second is not.
[4 Jan 2024 10:44]
Bernt Marius Johnsen
If one is only looking for values <= 7f, a workaround is to convert the string to binary. Bytes with value <= 7f is never part of a multibyte UTF-8 encoding. E.g:
mysql> select LOCATE(x'00',convert('abcdรฉf' using binary));
+-----------------------------------------------+
| LOCATE(x'00',convert('abcdรฉf' using binary)) |
+-----------------------------------------------+
| 0 |
+-----------------------------------------------+
1 row in set (0,01 sec)

Description: While searching for for NUL character (0x00) in our UTF8 strings in order to strip them (which is a requirement of our CDC pipeline), we discovered that matching a UTF-8 control character against multibyte characters (like emojis) returns a positive match even though the character is not present. We're using `utf8mb3` and `utf8mb4` columns. Connection setup: SELECT @@character_set_database, @@collation_database; 'utf8mb4', 'utf8mb4_0900_ai_ci' Eg. following code returns TRUE (for both columns): ``` SELECT POSITION(0x00 IN '๐2'), CHAR('๐2' USING utf8) LIKE CONCAT("%", CHAR(0x00 USING utf8), "%"); ``` but it should be FALSE - there is no 8 zero consecutive bits in the binary representation. I haven't found any mention of this behavior in the documentation. Can you please check if this is a bug or misunderstanding of how these functions work on valid UTF 8 Unicode strings? How to repeat: ``` SELECT LOCATE(UNHEX('00'), "๐"), LOCATE(UNHEX('01'), "๐"), LOCATE(UNHEX('02'), "๐"), LOCATE(UNHEX('03'), "๐"), LOCATE(UNHEX('04'), "๐"), LOCATE(UNHEX('05'), "๐"), LOCATE(UNHEX('06'), "๐"), LOCATE(UNHEX('07'), "๐"), LOCATE(UNHEX('08'), "๐"), LOCATE(UNHEX('09'), "๐"), LOCATE(UNHEX('0A'), "๐"), LOCATE(UNHEX('0B'), "๐"), LOCATE(UNHEX('0C'), "๐"), LOCATE(UNHEX('0D'), "๐"), LOCATE(UNHEX('0E'), "๐"), LOCATE(UNHEX('0F'), "๐"), LOCATE(UNHEX('10'), "๐") ; ``` The query above should also return FALSE for all the LOCATE calls but some of them return TRUE.