Bug #52080 Make REGEXP to work properly with a multibyte character sets
Submitted: 16 Mar 2010 6:57 Modified: 27 Aug 2010 9:59
Reporter: Pavel Sirovatsky Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: General Severity:S4 (Feature request)
Version:5.0.27 OS:Any
Assigned to: CPU Architecture:Any
Tags: REGEXP

[16 Mar 2010 6:57] Pavel Sirovatsky
Description:
REGEXP gives wrong result when try to collate non-english (cyrillic) charachters. utf8_unicode_ci is using for collation.

How to repeat:
SET NAMES utf8;

SELECT t.name
FROM (SELECT 'Я' as name) as t
WHERE t.name COLLATE utf8_unicode_ci LIKE 'я';

/*This query works fine*/

SELECT t.name
FROM (SELECT 'Я' as name) as t
WHERE t.name COLLATE utf8_unicode_ci REGEXP 'я';

/*This query don't gives a result*/
[16 Mar 2010 8:32] Valeriy Kravchuk
While this is easily repeatable, our manual (http://dev.mysql.com/doc/refman/5.0/en/regexp.html#operator_regexp) explains this IMHO:

"Warning

The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal."
[17 Mar 2010 7:38] Pavel Sirovatsky
change status to feature request
[17 Mar 2010 7:45] Valeriy Kravchuk
Making REGEXP to work properly with a multibyte character sets sounds like a reasonable and nice feature request.
[27 Aug 2010 9:59] Valeriy Kravchuk
Duplicate of Bug#30241.