Description:
REGEXP_* functions seem to generally use the subject character set when evaluating character classes. However, this breaks down when working with binary strings.
According to REGEXP_LIKE, binary 0x80 is not printable; but 0xFF is. This lines up with what I'd expect to see out of true latin1 (ISO-8859-1), but not MySQL's notion of latin1 (CP-1252); so the question is — what character set is binary pretending to be for the purposes of character class evaluation, and why doesn't it line up with any of the character_set_* variables?
How to repeat:
mysql> SHOW VARIABLES LIKE '%char%';
+--------------------------+--------------------------------+
| Variable_name | Value |
+--------------------------+--------------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql-8.0/charsets/ |
+--------------------------+--------------------------------+
8 rows in set (0.01 sec)
mysql> SELECT REGEXP_LIKE(_binary x'80', '^[:print:]+$');
+--------------------------------------------+
| REGEXP_LIKE(_binary x'80', '^[:print:]+$') |
+--------------------------------------------+
| 0 |
+--------------------------------------------+
1 row in set (0.00 sec)
mysql> SELECT REGEXP_LIKE(_latin1 x'80', '^[:print:]+$');
+--------------------------------------------+
| REGEXP_LIKE(_latin1 x'80', '^[:print:]+$') |
+--------------------------------------------+
| 1 |
+--------------------------------------------+
1 row in set (0.03 sec)
mysql> SELECT REGEXP_LIKE(_utf8mb4 x'80', '^[:print:]+$');
ERROR 1300 (HY000): Invalid utf8mb4 character string: '80'
mysql> SELECT REGEXP_LIKE(_binary x'FF', '^[:print:]+$');
+--------------------------------------------+
| REGEXP_LIKE(_binary x'FF', '^[:print:]+$') |
+--------------------------------------------+
| 1 |
+--------------------------------------------+
1 row in set (0.00 sec)
mysql> SELECT REGEXP_LIKE(_latin1 x'FF', '^[:print:]+$');
+--------------------------------------------+
| REGEXP_LIKE(_latin1 x'FF', '^[:print:]+$') |
+--------------------------------------------+
| 1 |
+--------------------------------------------+
1 row in set (0.04 sec)
mysql> SELECT REGEXP_LIKE(_utf8mb4 x'FF', '^[:print:]+$');
ERROR 1300 (HY000): Invalid utf8mb4 character string: 'FF'
Suggested fix:
Option 1.
Evaluating character classes in a binary string context should yield an error (whether using REGEXP or the REGEXP_* functions).
===
Option 2.
a. Evaluating character classes in a binary string context should yield a warning (whether using REGEXP or the REGEXP_* functions); and
b. `REGEXP_LIKE(, '^[:print:]+$')` should evaluate to `0` for all binary strings, . This should of course be consistent across other regexp functions and expressions. Other character classes should (analogously) be trivial (either empty or all encompassing) in the binary string context.