MySQL Bugs: #113537: False match of multibyte chars when checking existence of a control character

Bug #113537	False match of multibyte chars when checking existence of a control character
Submitted:	2 Jan 2024 16:00	Modified:	4 Jan 2024 7:57
Reporter:	Michael Olšavský	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S3 (Non-critical)
Version:	8.0.32	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	charset, nul byte, UTF-8

Description:
While searching for for NUL character (0x00) in our UTF8 strings in order to strip them (which is a requirement of our CDC pipeline), we discovered that matching a UTF-8 control character against multibyte characters (like emojis) returns a positive match even though the character is not present.

We're using `utf8mb3` and `utf8mb4` columns.

Connection setup:
SELECT @@character_set_database, @@collation_database;
'utf8mb4', 'utf8mb4_0900_ai_ci'

Eg. following code returns TRUE (for both columns):
```
SELECT
    POSITION(0x00 IN '😂2'),
    CHAR('😂2' USING utf8) LIKE CONCAT("%", CHAR(0x00 USING utf8), "%");
```

but it should be FALSE - there is no 8 zero consecutive bits in the binary representation. 

I haven't found any mention of this behavior in the documentation. Can you please check if this is a bug or misunderstanding of how these functions work on valid UTF 8 Unicode strings?

How to repeat:
```
SELECT
    LOCATE(UNHEX('00'), "😂"),
    LOCATE(UNHEX('01'), "😂"),
    LOCATE(UNHEX('02'), "😂"),
    LOCATE(UNHEX('03'), "😂"),
    LOCATE(UNHEX('04'), "😂"),
    LOCATE(UNHEX('05'), "😂"),
    LOCATE(UNHEX('06'), "😂"),
    LOCATE(UNHEX('07'), "😂"),
    LOCATE(UNHEX('08'), "😂"),
    LOCATE(UNHEX('09'), "😂"),
    LOCATE(UNHEX('0A'), "😂"),
    LOCATE(UNHEX('0B'), "😂"),
    LOCATE(UNHEX('0C'), "😂"),
    LOCATE(UNHEX('0D'), "😂"),
    LOCATE(UNHEX('0E'), "😂"),
    LOCATE(UNHEX('0F'), "😂"),
    LOCATE(UNHEX('10'), "😂")
;
```

The query above should also return FALSE for all the LOCATE calls but some of them return TRUE.

1: CHAR('😂2' USING utf8) does not give the string '😂2' since CHAR() expects an integer argument. (In this case it is equivalent with CHAR(0 USING utf8). See the documentation of CHAR(): https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_char

2: POSITION/LOCATE does not return TRUE/FALSE but a position. See https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_locate

But: There is a bug here. A simpler  example:

mysql> select LOCATE(x'00','abcdef');
+------------------------+
| LOCATE(x'00','abcdef') |
+------------------------+
|                      0 |
+------------------------+
1 row in set (0.00 sec)

mysql> select LOCATE(x'00','abcdéf');
+-------------------------+
| LOCATE(x'00','abcdéf')  |
+-------------------------+
|                       5 |
+-------------------------+
1 row in set (0.00 sec)

The first select is correct, the second is not.

If one is only looking for values <= 7f, a workaround is to convert the string to binary. Bytes with value <= 7f is never part of a multibyte UTF-8 encoding. E.g:

mysql> select LOCATE(x'00',convert('abcdéf' using binary));
+-----------------------------------------------+
| LOCATE(x'00',convert('abcdéf' using binary))  |
+-----------------------------------------------+
|                                             0 |
+-----------------------------------------------+
1 row in set (0,01 sec)