Bug #98951 Evaluating regexp character classes in binary string contexts
Submitted: 13 Mar 2020 23:14 Modified: 18 Jun 2020 16:49
Reporter: Jonathan Thiessen Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:8.0.19 OS:Any
Assigned to: CPU Architecture:Any

[13 Mar 2020 23:14] Jonathan Thiessen
Description:
REGEXP_* functions seem to generally use the subject character set when evaluating character classes. However, this breaks down when working with binary strings.

According to REGEXP_LIKE, binary 0x80 is not printable; but 0xFF is. This lines up with what I'd expect to see out of true latin1 (ISO-8859-1), but not MySQL's notion of latin1 (CP-1252); so the question is — what character set is binary pretending to be for the purposes of character class evaluation, and why doesn't it line up with any of the character_set_* variables?

How to repeat:
mysql> SHOW VARIABLES LIKE '%char%';
+--------------------------+--------------------------------+
| Variable_name            | Value                          |
+--------------------------+--------------------------------+
| character_set_client     | latin1                         |
| character_set_connection | latin1                         |
| character_set_database   | utf8mb4                        |
| character_set_filesystem | binary                         |
| character_set_results    | latin1                         |
| character_set_server     | utf8mb4                        |
| character_set_system     | utf8                           |
| character_sets_dir       | /usr/share/mysql-8.0/charsets/ |
+--------------------------+--------------------------------+
8 rows in set (0.01 sec)

mysql> SELECT REGEXP_LIKE(_binary x'80', '^[:print:]+$');
+--------------------------------------------+
| REGEXP_LIKE(_binary x'80', '^[:print:]+$') |
+--------------------------------------------+
|                                          0 |
+--------------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT REGEXP_LIKE(_latin1 x'80', '^[:print:]+$');
+--------------------------------------------+
| REGEXP_LIKE(_latin1 x'80', '^[:print:]+$') |
+--------------------------------------------+
|                                          1 |
+--------------------------------------------+
1 row in set (0.03 sec)

mysql> SELECT REGEXP_LIKE(_utf8mb4 x'80', '^[:print:]+$');
ERROR 1300 (HY000): Invalid utf8mb4 character string: '80'

mysql> SELECT REGEXP_LIKE(_binary x'FF', '^[:print:]+$');
+--------------------------------------------+
| REGEXP_LIKE(_binary x'FF', '^[:print:]+$') |
+--------------------------------------------+
|                                          1 |
+--------------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT REGEXP_LIKE(_latin1 x'FF', '^[:print:]+$');
+--------------------------------------------+
| REGEXP_LIKE(_latin1 x'FF', '^[:print:]+$') |
+--------------------------------------------+
|                                          1 |
+--------------------------------------------+
1 row in set (0.04 sec)

mysql> SELECT REGEXP_LIKE(_utf8mb4 x'FF', '^[:print:]+$');
ERROR 1300 (HY000): Invalid utf8mb4 character string: 'FF'

Suggested fix:
Option 1.

Evaluating character classes in a binary string context should yield an error (whether using REGEXP or the REGEXP_* functions).

===

Option 2.

a. Evaluating character classes in a binary string context should yield a warning (whether using REGEXP or the REGEXP_* functions); and

b. `REGEXP_LIKE(, '^[:print:]+$')` should evaluate to `0` for all binary strings, . This should of course be consistent across other regexp functions and expressions. Other character classes should (analogously) be trivial (either empty or all encompassing) in the binary string context.
[13 Mar 2020 23:20] Jonathan Thiessen
In the suggested fix, option 2b:

`REGEXP_LIKE(, '^[:print:]+$')` could be chosen to evaluate to `1` for all binary strings,  instead.

There's a sense in which binary strings are printable, and a sense in which they are not.

What's important here is that all binary strings are equally printable.
[14 Mar 2020 1:47] MySQL Verification Team
Thank you for the bug report.
[18 Jun 2020 16:49] Paul DuBois
Posted by developer:

Fixed in 8.0.22.

Regular expression functions such as REGEXP_LIKE() yielded
inconsistent results with binary string arguments. These functions
now reject binary strings with an error.