Bug #30241 Regular expression problems
Submitted: 4 Aug 2007 20:39 Modified: 29 Sep 2008 7:09
Reporter: Joshua Brickel Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:5.1, 6.0 OS:Linux
Assigned to: Assigned Account CPU Architecture:Any
Tags: multi-byte, REGEXP, regular expression

[4 Aug 2007 20:39] Joshua Brickel
Description:
In a normal ASCII regular expression the following works correctly

select "ab cd" regexp binary "^[^c]"  ==> 1

however when I do a similar expression in hebrew

select "אב רק" regexp binary "^[^ב]"

returns 0 but it should return True (1). Hebrew is a right to left language so this problem may occur with other right to left languages too.

How to repeat:
select "אב רק" regexp binary "^[^ב]"

Suggested fix:
1) fix so Right to Left languages work properly.
or 
2) allow for non-Ascii characters to be included by specifying the charachters via their hexadecimal UTF-8 or UCS-2 values, and document this work around for RtoL languages
[6 Aug 2007 8:51] Sveta Smirnova
Thank you for the report.

Verified as described.

Workaround: select "אב רק" regexp binary "^[ב^]";
[7 Aug 2007 10:09] Sergei Golubchik
The manual needs to clarify in http://dev.mysql.com/doc/refman/5.0/en/regexp.html that our regexp/rlike works bytewise, and may produce unexpected results with multi-byte character sets.
[7 Aug 2007 16:31] Paul DuBois
Thank you for your bug report. This issue has been addressed in the documentation. The updated documentation will appear on our website shortly, and will be included in the next release of the relevant products.

Added note to indicate the byte-wise operation of these operators and that they might not work as expected for multi-byte character sets.
[7 Aug 2007 20:52] Joshua Brickel
Just my two cents.  I think since MySQL wants to be a database for the world, it would be better if their regular expressions supported multi-byte characters and did not simply get rid of this bug via a statement in documentation stating that the regexp does not support multi-byte characters.  

It seems that trolltech's QT3 supports multi-byte regular expression searches, so it does seem to be possible.  I would at least recommend this as request for enhancement.
[7 Aug 2007 21:31] Sergei Golubchik
Yes, absolutely. We want to change the regex library, it's in todo.
It is certainly possible, there are multi-byte aware regex libraries out there.
[24 Nov 2008 13:31] Alexander Barkov
See also bug#34473
[21 Feb 2009 22:00] Andreas Götz
So... this issue is 18 months old and there is no sign of a multi-byte regexp library to be seen anywhere- niether MySQL 5.0, 5.1 or 6.0. Is it ever gonna happen?
[23 Aug 2010 19:19] Sveta Smirnova
Bug #54576 was marked as duplicate of this one.
[27 Aug 2010 10:01] Valeriy Kravchuk
Bug #52080  was marked as a duplicate of this one.
[15 May 2012 16:24] Valeriy Kravchuk
Bug #64370 was marked as a duplicate of this one.
[23 Feb 2016 14:15] Ghanshyam Patel
Use the lib_mysqludf_preg library from the mysql UDF repository for PCRE regular expressions directly in mysql

http://www.mysqludf.org/
https://github.com/mysqludf/lib_mysqludf_preg#readme
[3 Jan 9:13] Daniël van Eeden
Regular expression support was improved in MySQL 8.0 
https://dev.mysql.com/doc/refman/8.0/en/mysql-nutshell.html

So this issue might not be present in new versions.