Bug #94203 REGEXP_REPLACE adds zero-bytes
Submitted: 5 Feb 2019 4:31 Modified: 25 Apr 2019 5:40
Reporter: Rick James Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: DML Severity:S3 (Non-critical)
Version:8.0.15 OS:Any
Assigned to: CPU Architecture:Any
Tags: REGEXP_REPLACE

[5 Feb 2019 4:31] Rick James
Description:
Notice the spurius NULs:

mysql> SELECT REGEXP_REPLACE('abcd', 'c', '') = 'abd';
+-----------------------------------------+
| REGEXP_REPLACE('abcd', 'c', '') = 'abd' |
+-----------------------------------------+
|                                       1 |
+-----------------------------------------+

mysql> SELECT HEX(REGEXP_REPLACE('abcd', 'c', '')), HEX('abd');
+--------------------------------------+------------+
| HEX(REGEXP_REPLACE('abcd', 'c', '')) | HEX('abd') |
+--------------------------------------+------------+
| 610062006400                         | 616264     |
+--------------------------------------+------------+

mysql> SELECT LENGTH(REGEXP_REPLACE('abcd', 'c', '')), LENGTH('abd');
+-----------------------------------------+---------------+
| LENGTH(REGEXP_REPLACE('abcd', 'c', '')) | LENGTH('abd') |
+-----------------------------------------+---------------+
|                                       6 |             3 |
+-----------------------------------------+---------------+

mysql> SELECT @@version;
+-----------+
| @@version |
+-----------+
| 8.0.15    |
+-----------+

How to repeat:
SELECT REGEXP_REPLACE('abcd', 'c', '') = 'abd';  -- Expect 1, get 1

SELECT HEX(REGEXP_REPLACE('abcd', 'c', '')), HEX('abd');  -- Expect same hex, but no

SELECT LENGTH(REGEXP_REPLACE('abcd', 'c', '')), LENGTH('abd');  -- Expect 3 and 3; get 6 and 3
[5 Feb 2019 5:55] Rick James
Could it be that the intent is to convert to CHARACTER SET ucs2, but not convert back?
[5 Feb 2019 6:10] MySQL Verification Team
Hello Rick James,

Thank you for the report and feedback.

regards,
Umesh
[5 Feb 2019 21:22] Ujjwal Singh
For now the workaround is: an additional REPLACE of CHAR(0) with ''
[24 Apr 2019 22:01] Jon Stephens
Documented fix as follows in the MySQL 8.0.17 changelog:

    The results returned by the functions REGEXP_REPLACE() and
    REGEXP_SUBSTR() used UTF-16 rather than the character set of the
    expression searched for matches.

Also noted the change in the descriptions of these functions in the Manual.

Closed.
[25 Apr 2019 5:40] Rick James
Are these two functions the only string functions that deliberately use a CHARACTER SET other than what the connection is using?