Bug #75438 REGEXP is case sensitive for most russian letters
Submitted: 7 Jan 2015 18:21 Modified: 25 Nov 2019 22:15
Reporter: Павел Сикорский Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: DML Severity:S3 (Non-critical)
Version:5.1.68, 5.6.20, 5.6.24 OS:Any
Assigned to: CPU Architecture:Any
Tags: REGEXP, russian, utf8_unicode_ci

[7 Jan 2015 18:21] Павел Сикорский
Description:
Strange REGEXP behavior. It is case-sensitive for some russian letters, and half-case-sensitive for others.

How to repeat:
During investigation I've generated file with such command for every russian and english letters: 
SELECT 
  @v:='a' COLLATE utf8_unicode_ci letter, 
  UCASE(@v) REGEXP LCASE(@v) RUvsL, 
  LCASE(@v) REGEXP UCASE(@v) RLvsU, 
  UCASE(@v) LIKE LCASE(@v) LUvsL, 
  LCASE(@v) LIKE UCASE(@v) LLvsU, 
  UCASE(@v), LCASE(@v);

For all english letters result looks like this:
a       1       1       1       1       A       a

For russian letters from 'а' to 'п':
а       0       1       1       1       А       а
п       0       1       1       1       П       п

For the others:
р       0       0       1       1       Р       р
я       0       0       1       1       Я       я
[7 Jan 2015 18:32] Павел Сикорский
tags changed
[7 Jan 2015 19:42] Peter Laursen
Trying with Danish character 'æ':

SELECT 
  @v:='æ' COLLATE utf8_unicode_ci letter, 
  UCASE(@v) REGEXP LCASE(@v) RUvsL, 
  LCASE(@v) REGEXP UCASE(@v) RLvsU, 
  UCASE(@v) LIKE LCASE(@v) LUvsL, 
  LCASE(@v) LIKE UCASE(@v) LLvsU, 
  UCASE(@v), LCASE(@v);
/* retruns

letter   RUvsL   RLvsU   LUvsL   LLvsU  UCASE(@v)  LCASE(@v)  
------  ------  ------  ------  ------  ---------  -----------
æ            0       1       1       1  Æ          æ         
*/

I think that regular expressions have serious issues with non-ASCII characters. I remember I bumped into something similar before.

-- Peter
-- not an Oracle/MySQL person
[7 Jan 2015 19:47] Peter Laursen
see also 

http://bugs.mysql.com/bug.php?id=63439 (umlauts as in Swedish, German, Hungarian etc.) 
http://bugs.mysql.com/bug.php?id=30241 (hebrew)
[8 Jan 2015 7:40] MySQL Verification Team
Hello Павел Сикорский,

Thank you for the report and test case.

Thanks,
Umesh
[8 Jan 2015 7:42] MySQL Verification Team
// 5.6.24

mysql> SELECT
    ->   @v:='a' COLLATE utf8_unicode_ci letter,
    ->   UCASE(@v) REGEXP LCASE(@v) RUvsL,
    ->   LCASE(@v) REGEXP UCASE(@v) RLvsU,
    ->   UCASE(@v) LIKE LCASE(@v) LUvsL,
    ->   LCASE(@v) LIKE UCASE(@v) LLvsU,
    ->   UCASE(@v), LCASE(@v);
+--------+-------+-------+-------+-------+-----------+-----------+
| letter | RUvsL | RLvsU | LUvsL | LLvsU | UCASE(@v) | LCASE(@v) |
+--------+-------+-------+-------+-------+-----------+-----------+
| a      |     1 |     1 |     1 |     1 | A         | a         |
+--------+-------+-------+-------+-------+-----------+-----------+
1 row in set (0.00 sec)

mysql> SELECT                                                                                                                                                                                         ->   @v:='п' COLLATE utf8_unicode_ci letter,
    ->   UCASE(@v) REGEXP LCASE(@v) RUvsL,
    ->   LCASE(@v) REGEXP UCASE(@v) RLvsU,
    ->   UCASE(@v) LIKE LCASE(@v) LUvsL,
    ->   LCASE(@v) LIKE UCASE(@v) LLvsU,
    ->   UCASE(@v), LCASE(@v);
+--------+-------+-------+-------+-------+-----------+-----------+
| letter | RUvsL | RLvsU | LUvsL | LLvsU | UCASE(@v) | LCASE(@v) |
+--------+-------+-------+-------+-------+-----------+-----------+
| п      |     0 |     1 |     1 |     1 | П         | п         |
+--------+-------+-------+-------+-------+-----------+-----------+
1 row in set (0.00 sec)

mysql> show variables like '%version%';
+-------------------------+---------------------------------------------------------+
| Variable_name           | Value                                                   |
+-------------------------+---------------------------------------------------------+
| innodb_version          | 5.6.24                                                  |
| protocol_version        | 10                                                      |
| slave_type_conversions  |                                                         |
| version                 | 5.6.24-enterprise-commercial-advanced                   |
| version_comment         | MySQL Enterprise Server - Advanced Edition (Commercial) |
| version_compile_machine | x86_64                                                  |
| version_compile_os      | linux-glibc2.5                                          |
+-------------------------+---------------------------------------------------------+
7 rows in set (0.00 sec)

mysql> \s
--------------
bin/mysql  Ver 14.14 Distrib 5.6.24, for linux-glibc2.5 (x86_64) using  EditLine wrapper

Connection id:          1
Current database:       test
Current user:           root@localhost
SSL:                    Not in use
Current pager:          more
Using outfile:          ''
Using delimiter:        ;
Server version:         5.6.24-enterprise-commercial-advanced MySQL Enterprise Server - Advanced Edition (Commercial)
Protocol version:       10
Connection:             Localhost via UNIX socket
Server characterset:    latin1
Db     characterset:    latin1
Client characterset:    utf8
Conn.  characterset:    utf8
UNIX socket:            /tmp/75438.sock
Uptime:                 4 min 39 sec

Threads: 1  Questions: 12  Slow queries: 0  Opens: 67  Flush tables: 1  Open tables: 60  Queries per second avg: 0.043
--------------
[25 Nov 2019 22:15] Roy Lyseng
Posted by developer:
 
Fixed in 8.0 with new REGEXP implementation