MySQL Bugs: #34473: SELECT...RLIKE returns wrong data on UTF8 connection

Bug #34473	SELECT...RLIKE returns wrong data on UTF8 connection
Submitted:	11 Feb 2008 21:19	Modified:	24 Nov 2008 13:31
Reporter:	Andreas Götz	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: General	Severity:	S4 (Feature request)
Version:	5.0, 5.1, 6.0	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	character set, collation, regular expression

Description:
I'm running this query against my utf-8 database:

SELECT title
FROM videodata
WHERE title RLIKE '^[Ö]'

which returns this result:

Ørneborgen

While Ö is a German Umlaut, the Ø is a danish character- and both are totally different. Using LIKE, the behaviour does not occur:

SELECT title
FROM test
WHERE title LIKE 'Ö%'

How to repeat:
test case:

create table test (
  title VARCHAR(255)
) CHARACTER SET UTF8, ENGINE=MyISAM;

insert into test (title) values ('Ørneborgen');

SELECT title
FROM test
WHERE title RLIKE '^[Ö]'

Suggested fix:
Ensure character encoding for the regexp parser

Mhhhm. Now I've just seen this in the MySQL docs:

Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal. 

Does that mean I'm on my own? I do have a nice UTF8 database, all my code is unicode-aware and I'm struggeling at the finish line?

I think you can persuade Perl regexps with //u to work with unicode characters, too- but not MySQL?

One possibility would be to install a UDF that implements unicode-aware regexp. Either google for it or create it yourself. Actually, I remember seeing pcre-compatible UDF somewhere, could be good enough

Thanks for the hint. Doesn't work for the typical hosting environment, though.

Could we treat this as an enhancement request? MySQL is now fully unicode-aware, from connection to database. Only the regexp parser insists on byte mode??

There is WorkLog for this feature already: http://forge1.mysql.com/worklog/task.php?id=353

According to this I am closing this report as "Not a Bug". But if you wish it is possible to convert this report to feature request.

Changed to feature request. As it is, regexp is unusable when working in unicode environment.

Andreas,

many thanks for writing a bug report.

I am pretty sure, you are more family with Danish, then me.

At the moment, we just support one UTF-8 collation (utf8_unicode_ci) for several languages.

The German "Dudensortierung" is implemented here.

For more informations about German collations you can look to:
http://de.wikipedia.org/wiki/Alphabetische_Sortierung

Or just compare the books: "Duden" and "Telefonbuch". You will recognise, that there are differences.

At utf8_unicode_ci, there is the rule: ö=o

As far as I know, the Danish Ø is similar to the German ö.

Because there is an own utf8_danish_ci, the utf8_unicode_ci has the German rules here. And therefor ö and Ø = o.

As Sveta told you, it's already on our "todo" to make an utf8 collation that's more familiar for Germans.

But I think, this won't help you.

If I understand you right, you want a collation mix from German and Danish.

Therefor the best way is, to create your own collation.

You can edit: DATADIRECTORY/share/mysql/charsets/latin1.xml

This link could help you too:
http://forge.mysql.com/wiki/How_to_Add_a_Collation

Susanne,

I really appreciate your extensive comment, but I believe it is besides the point. I'm fluent in German, not so much Danish, but the problem here is not collations- I believe they work just fine. The problem is REGEXP.

If you look at the sample- it doesn't contain sorting anywhere. Problem is that the RLIKE operator compares the danish Ö or what ever multi-byte character it is in a byte-wise fashion instead of per character. To Ö is treated as two byte values. If my DB now contains a title starting with O (only one byte in UTF8) and another character, the the first character Ö of the pattern (2 bytes) is compares with the first 2 characters (also 2 bytes) in the database- which is wrong as we don't care about the byte order of any encoding but of the logical order (i.e. collation) of characters.

I'd be entirely happy with the current collation functionality- if REGEXP would only honor it...

Thanks,
Andreas

Bug#34552 was marked as a duplicate of this bug

This is a related to Bug #1103 and Bug #30635. 
Bug #344552 is a duplicate of this bug.
Since the documentation states that this is a known limitation of MySQL implementation of RLIKE I'm moving it back to "feature request".

Thanks for the update.

To be clear- this bug is NOT about collations as Susanne summarized. This bug is about multi-byte character set support for the regexp library.

It is a feature request- and at the same time a real shame that a DB claiming utf8 support is unable to run regular expressions on it. 

If feature request then it should be one of priority...

This bug is a duplicate for Bug#30241

Would there be any possibility of a work-around by specifying a multi-byte character as individual bytes in the regular expression string?

For example a two-byte UTF8 character to be searched for could be specified in the RegEx string as two single bytes side by side.

Please let me know if anyone can build on this idea or if the approach I'm suggesting is overly-simplistic.

This problem is causing us major problems and I've heard from several others that it's causing them big headaches so any suggestions would be very valuable.

Will