Bug #34473 SELECT...RLIKE returns wrong data on UTF8 connection
Submitted: 11 Feb 2008 21:19 Modified: 24 Nov 2008 13:31
Reporter: Andreas Götz Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: General Severity:S4 (Feature request)
Version:5.0, 5.1, 6.0 OS:Any
Assigned to: CPU Architecture:Any
Tags: character set, collation, regular expression

[11 Feb 2008 21:19] Andreas Götz
Description:
I'm running this query against my utf-8 database:

SELECT title
FROM videodata
WHERE title RLIKE '^[Ö]'

which returns this result:

Ørneborgen

While Ö is a German Umlaut, the Ø is a danish character- and both are totally different. Using LIKE, the behaviour does not occur:

SELECT title
FROM test
WHERE title LIKE 'Ö%'

How to repeat:
test case:

create table test (
  title VARCHAR(255)
) CHARACTER SET UTF8, ENGINE=MyISAM;

insert into test (title) values ('Ørneborgen');

SELECT title
FROM test
WHERE title RLIKE '^[Ö]'

Suggested fix:
Ensure character encoding for the regexp parser
[11 Feb 2008 21:40] Andreas Götz
Mhhhm. Now I've just seen this in the MySQL docs:

Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal. 

Does that mean I'm on my own? I do have a nice UTF8 database, all my code is unicode-aware and I'm struggeling at the finish line?

I think you can persuade Perl regexps with //u to work with unicode characters, too- but not MySQL?
[12 Feb 2008 0:41] Sergei Golubchik
One possibility would be to install a UDF that implements unicode-aware regexp. Either google for it or create it yourself. Actually, I remember seeing pcre-compatible UDF somewhere, could be good enough
[12 Feb 2008 8:06] Andreas Götz
Thanks for the hint. Doesn't work for the typical hosting environment, though.

Could we treat this as an enhancement request? MySQL is now fully unicode-aware, from connection to database. Only the regexp parser insists on byte mode??
[12 Feb 2008 9:33] Sveta Smirnova
There is WorkLog for this feature already: http://forge1.mysql.com/worklog/task.php?id=353

According to this I am closing this report as "Not a Bug". But if you wish it is possible to convert this report to feature request.
[12 Feb 2008 12:58] Andreas Götz
Changed to feature request. As it is, regexp is unusable when working in unicode environment.
[12 Feb 2008 19:31] Susanne Ebrecht
Andreas,

many thanks for writing a bug report.

I am pretty sure, you are more family with Danish, then me.

At the moment, we just support one UTF-8 collation (utf8_unicode_ci) for several languages.

The German "Dudensortierung" is implemented here.

For more informations about German collations you can look to:
http://de.wikipedia.org/wiki/Alphabetische_Sortierung

Or just compare the books: "Duden" and "Telefonbuch". You will recognise, that there are differences.

At utf8_unicode_ci, there is the rule: ö=o

As far as I know, the Danish Ø is similar to the German ö.

Because there is an own utf8_danish_ci, the utf8_unicode_ci has the German rules here. And therefor ö and Ø = o.

As Sveta told you, it's already on our "todo" to make an utf8 collation that's more familiar for Germans.

But I think, this won't help you.

If I understand you right, you want a collation mix from German and Danish.

Therefor the best way is, to create your own collation.

You can edit: DATADIRECTORY/share/mysql/charsets/latin1.xml

This link could help you too:
http://forge.mysql.com/wiki/How_to_Add_a_Collation
[12 Feb 2008 21:52] Andreas Götz
Susanne,

I really appreciate your extensive comment, but I believe it is besides the point. I'm fluent in German, not so much Danish, but the problem here is not collations- I believe they work just fine. The problem is REGEXP.

If you look at the sample- it doesn't contain sorting anywhere. Problem is that the RLIKE operator compares the danish Ö or what ever multi-byte character it is in a byte-wise fashion instead of per character. To Ö is treated as two byte values. If my DB now contains a title starting with O (only one byte in UTF8) and another character, the the first character Ö of the pattern (2 bytes) is compares with the first 2 characters (also 2 bytes) in the database- which is wrong as we don't care about the byte order of any encoding but of the logical order (i.e. collation) of characters.

I'd be entirely happy with the current collation functionality- if REGEXP would only honor it...

Thanks,
Andreas
[14 Feb 2008 18:10] Sergei Golubchik
Bug#34552 was marked as a duplicate of this bug
[29 Sep 2008 13:01] Georgi Kodinov
This is a related to Bug #1103 and Bug #30635. 
Bug #344552 is a duplicate of this bug.
Since the documentation states that this is a known limitation of MySQL implementation of RLIKE I'm moving it back to "feature request".
[29 Sep 2008 16:37] Andreas Götz
Thanks for the update.

To be clear- this bug is NOT about collations as Susanne summarized. This bug is about multi-byte character set support for the regexp library.

It is a feature request- and at the same time a real shame that a DB claiming utf8 support is unable to run regular expressions on it. 

If feature request then it should be one of priority...
[24 Nov 2008 13:31] Alexander Barkov
This bug is a duplicate for Bug#30241
[21 Feb 2009 18:54] Will Robertson
Would there be any possibility of a work-around by specifying a multi-byte character as individual bytes in the regular expression string?

For example a two-byte UTF8 character to be searched for could be specified in the RegEx string as two single bytes side by side.

Please let me know if anyone can build on this idea or if the approach I'm suggesting is overly-simplistic.

This problem is causing us major problems and I've heard from several others that it's causing them big headaches so any suggestions would be very valuable.

Will