MySQL Bugs: #52080: Make REGEXP to work properly with a multibyte character sets

Bug #52080	Make REGEXP to work properly with a multibyte character sets
Submitted:	16 Mar 2010 6:57	Modified:	27 Aug 2010 9:59
Reporter:	Pavel Sirovatsky	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: General	Severity:	S4 (Feature request)
Version:	5.0.27	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	REGEXP

Description:
REGEXP gives wrong result when try to collate non-english (cyrillic) charachters. utf8_unicode_ci is using for collation.

How to repeat:
SET NAMES utf8;

SELECT t.name
FROM (SELECT 'Я' as name) as t
WHERE t.name COLLATE utf8_unicode_ci LIKE 'я';

/*This query works fine*/

SELECT t.name
FROM (SELECT 'Я' as name) as t
WHERE t.name COLLATE utf8_unicode_ci REGEXP 'я';

/*This query don't gives a result*/

While this is easily repeatable, our manual (http://dev.mysql.com/doc/refman/5.0/en/regexp.html#operator_regexp) explains this IMHO:

"Warning

The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal."

change status to feature request

Making REGEXP to work properly with a multibyte character sets sounds like a reasonable and nice feature request.

Duplicate of Bug#30241.