Bug #98950 REGEXP_REPLACE and REGEXP_SUBSTR still add null bytes
Submitted: 13 Mar 2020 21:25 Modified: 6 Jul 2020 13:20
Reporter: Jonathan Thiessen Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:8.0.19 OS:Any
Assigned to: CPU Architecture:Any

[13 Mar 2020 21:25] Jonathan Thiessen
Description:
A specialization of https://bugs.mysql.com/bug.php?id=94203, seemingly limited to the binary charset.

When performing either REGEXP_REPLACE or REGEXP_SUBSTR on binary strings, every byte of the result appears to be right-padded with/appended by a null byte (consistent with the utf16le result bug in <8.0.17 for other character sets).

The functions are in fact returning binary strings, they just have the wrong contents (as evidenced by `HEX` and `CHARSET`).

One workaround is to wrap REGEXP_REPLACE and REGEXP_SUBSTR calls in HEX/UNHEX, and work on encoded strings representing the original binary contents, but that's a hack, and can lead to REGEXP related timeouts depending on the size of the expression.

How to repeat:
```
$ docker run --rm --name mysql -e MYSQL_ALLOW_EMPTY_PASSWORD=true -d mysql:8.0.19

...

$ docker exec -it mysql mysql -uroot

...

mysql> SELECT HEX(REGEXP_REPLACE(x'0123456789ABCDEF', '[\\x23\\xCD]', x'FF'));
+-----------------------------------------------------------------+
| HEX(REGEXP_REPLACE(x'0123456789ABCDEF', '[\\x23\\xCD]', x'FF')) |
+-----------------------------------------------------------------+
| 0100FF00450067008900AB00FF00EF00                                |
+-----------------------------------------------------------------+
1 row in set (0.01 sec)

mysql> SELECT HEX(REGEXP_SUBSTR(x'0123456789ABCDEF', '[\\x30-\\x50].+[\\xAB]'));
+-------------------------------------------------------------------+
| HEX(REGEXP_SUBSTR(x'0123456789ABCDEF', '[\\x30-\\x50].+[\\xAB]')) |
+-------------------------------------------------------------------+
| 450067008900AB00                                                  |
+-------------------------------------------------------------------+
1 row in set (0.00 sec)
```
[14 Mar 2020 1:53] MySQL Verification Team
Thank you for the bug report.
[6 Jul 2020 12:42] Tor Didriksen
Posted by developer:
 
Fixed by the same patch as:
Bug#31031886: EVALUATING REGEXP CHARACTER CLASSES IN BINARY STRING CONTEXTS
[6 Jul 2020 13:20] Paul DuBois
Posted by developer:
 
Fixed in 8.0.22.

Regular expression functions such as REGEXP_LIKE() yielded
inconsistent results with binary string arguments. These functions
now reject binary strings with an error.