Bug #113537 False match of multibyte chars when checking existence of a control character
Submitted: 2 Jan 2024 16:00 Modified: 4 Jan 2024 7:57
Reporter: Michael Olลกavskรฝ Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:8.0.32 OS:Any
Assigned to: CPU Architecture:Any
Tags: charset, nul byte, UTF-8

[2 Jan 2024 16:00] Michael Olลกavskรฝ
Description:
While searching for for NUL character (0x00) in our UTF8 strings in order to strip them (which is a requirement of our CDC pipeline), we discovered that matching a UTF-8 control character against multibyte characters (like emojis) returns a positive match even though the character is not present.

We're using `utf8mb3` and `utf8mb4` columns.

Connection setup:
SELECT @@character_set_database, @@collation_database;
'utf8mb4', 'utf8mb4_0900_ai_ci'

Eg. following code returns TRUE (for both columns):
```
SELECT
    POSITION(0x00 IN '๐Ÿ˜‚2'),
    CHAR('๐Ÿ˜‚2' USING utf8) LIKE CONCAT("%", CHAR(0x00 USING utf8), "%");
```

but it should be FALSE - there is no 8 zero consecutive bits in the binary representation. 

I haven't found any mention of this behavior in the documentation. Can you please check if this is a bug or misunderstanding of how these functions work on valid UTF 8 Unicode strings?

How to repeat:
```
SELECT
    LOCATE(UNHEX('00'), "๐Ÿ˜‚"),
    LOCATE(UNHEX('01'), "๐Ÿ˜‚"),
    LOCATE(UNHEX('02'), "๐Ÿ˜‚"),
    LOCATE(UNHEX('03'), "๐Ÿ˜‚"),
    LOCATE(UNHEX('04'), "๐Ÿ˜‚"),
    LOCATE(UNHEX('05'), "๐Ÿ˜‚"),
    LOCATE(UNHEX('06'), "๐Ÿ˜‚"),
    LOCATE(UNHEX('07'), "๐Ÿ˜‚"),
    LOCATE(UNHEX('08'), "๐Ÿ˜‚"),
    LOCATE(UNHEX('09'), "๐Ÿ˜‚"),
    LOCATE(UNHEX('0A'), "๐Ÿ˜‚"),
    LOCATE(UNHEX('0B'), "๐Ÿ˜‚"),
    LOCATE(UNHEX('0C'), "๐Ÿ˜‚"),
    LOCATE(UNHEX('0D'), "๐Ÿ˜‚"),
    LOCATE(UNHEX('0E'), "๐Ÿ˜‚"),
    LOCATE(UNHEX('0F'), "๐Ÿ˜‚"),
    LOCATE(UNHEX('10'), "๐Ÿ˜‚")
;
```

The query above should also return FALSE for all the LOCATE calls but some of them return TRUE.
[3 Jan 2024 13:31] Bernt Marius Johnsen
1: CHAR('๐Ÿ˜‚2' USING utf8) does not give the string '๐Ÿ˜‚2' since CHAR() expects an integer argument. (In this case it is equivalent with CHAR(0 USING utf8). See the documentation of CHAR(): https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_char

2: POSITION/LOCATE does not return TRUE/FALSE but a position. See https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_locate

But: There is a bug here. A simpler  example:

mysql> select LOCATE(x'00','abcdef');
+------------------------+
| LOCATE(x'00','abcdef') |
+------------------------+
|                      0 |
+------------------------+
1 row in set (0.00 sec)

mysql> select LOCATE(x'00','abcdรฉf');
+-------------------------+
| LOCATE(x'00','abcdรฉf')  |
+-------------------------+
|                       5 |
+-------------------------+
1 row in set (0.00 sec)

The first select is correct, the second is not.
[4 Jan 2024 10:44] Bernt Marius Johnsen
If one is only looking for values <= 7f, a workaround is to convert the string to binary. Bytes with value <= 7f is never part of a multibyte UTF-8 encoding. E.g:

mysql> select LOCATE(x'00',convert('abcdรฉf' using binary));
+-----------------------------------------------+
| LOCATE(x'00',convert('abcdรฉf' using binary))  |
+-----------------------------------------------+
|                                             0 |
+-----------------------------------------------+
1 row in set (0,01 sec)