MySQL Bugs: #98951: Evaluating regexp character classes in binary string contexts

Bug #98951	Evaluating regexp character classes in binary string contexts
Submitted:	13 Mar 2020 23:14	Modified:	18 Jun 2020 16:49
Reporter:	Jonathan Thiessen	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S3 (Non-critical)
Version:	8.0.19	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
REGEXP_* functions seem to generally use the subject character set when evaluating character classes. However, this breaks down when working with binary strings.

According to REGEXP_LIKE, binary 0x80 is not printable; but 0xFF is. This lines up with what I'd expect to see out of true latin1 (ISO-8859-1), but not MySQL's notion of latin1 (CP-1252); so the question is — what character set is binary pretending to be for the purposes of character class evaluation, and why doesn't it line up with any of the character_set_* variables?

How to repeat:
mysql> SHOW VARIABLES LIKE '%char%';
+--------------------------+--------------------------------+
| Variable_name            | Value                          |
+--------------------------+--------------------------------+
| character_set_client     | latin1                         |
| character_set_connection | latin1                         |
| character_set_database   | utf8mb4                        |
| character_set_filesystem | binary                         |
| character_set_results    | latin1                         |
| character_set_server     | utf8mb4                        |
| character_set_system     | utf8                           |
| character_sets_dir       | /usr/share/mysql-8.0/charsets/ |
+--------------------------+--------------------------------+
8 rows in set (0.01 sec)

mysql> SELECT REGEXP_LIKE(_binary x'80', '^[:print:]+$');
+--------------------------------------------+
| REGEXP_LIKE(_binary x'80', '^[:print:]+$') |
+--------------------------------------------+
|                                          0 |
+--------------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT REGEXP_LIKE(_latin1 x'80', '^[:print:]+$');
+--------------------------------------------+
| REGEXP_LIKE(_latin1 x'80', '^[:print:]+$') |
+--------------------------------------------+
|                                          1 |
+--------------------------------------------+
1 row in set (0.03 sec)

mysql> SELECT REGEXP_LIKE(_utf8mb4 x'80', '^[:print:]+$');
ERROR 1300 (HY000): Invalid utf8mb4 character string: '80'

mysql> SELECT REGEXP_LIKE(_binary x'FF', '^[:print:]+$');
+--------------------------------------------+
| REGEXP_LIKE(_binary x'FF', '^[:print:]+$') |
+--------------------------------------------+
|                                          1 |
+--------------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT REGEXP_LIKE(_latin1 x'FF', '^[:print:]+$');
+--------------------------------------------+
| REGEXP_LIKE(_latin1 x'FF', '^[:print:]+$') |
+--------------------------------------------+
|                                          1 |
+--------------------------------------------+
1 row in set (0.04 sec)

mysql> SELECT REGEXP_LIKE(_utf8mb4 x'FF', '^[:print:]+$');
ERROR 1300 (HY000): Invalid utf8mb4 character string: 'FF'

Suggested fix:
Option 1.

Evaluating character classes in a binary string context should yield an error (whether using REGEXP or the REGEXP_* functions).

===

Option 2.

a. Evaluating character classes in a binary string context should yield a warning (whether using REGEXP or the REGEXP_* functions); and

b. `REGEXP_LIKE(, '^[:print:]+$')` should evaluate to `0` for all binary strings, . This should of course be consistent across other regexp functions and expressions. Other character classes should (analogously) be trivial (either empty or all encompassing) in the binary string context.

In the suggested fix, option 2b:

`REGEXP_LIKE(, '^[:print:]+$')` could be chosen to evaluate to `1` for all binary strings,  instead.

There's a sense in which binary strings are printable, and a sense in which they are not.

What's important here is that all binary strings are equally printable.

Thank you for the bug report.

Posted by developer:

Fixed in 8.0.22.

Regular expression functions such as REGEXP_LIKE() yielded
inconsistent results with binary string arguments. These functions
now reject binary strings with an error.