Bug #34552 REGEXP fails with multi-byte characters
Submitted: 14 Feb 2008 15:57 Modified: 14 Feb 2008 18:10
Reporter: Andreas Götz Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:5.0.54 OS:Any
Assigned to: CPU Architecture:Any

[14 Feb 2008 15:57] Andreas Götz
Description:
I had already reported in bug 34473 that RLIKE files on multi-byte data. Investigation showed that this is due to the REGEXP engine working on byte values (doh).

New issue found is that regexp engine- independently from byte values- seems to treat characters and classes differently.

How to repeat:
SELECT 'Ø' REGEXP 'Ö';
-> result 0

SELECT 'Ø' REGEXP '[Ö]';
-> result 1

Expected result:
If Ø matches Ö at all for a given collation (even if handled byte-wise), the result should at least be consistent.

Suggested fix:
Upgrade REGEXP library- the current implementation is not worthy of MySQLs other unicode capabilities.
[14 Feb 2008 16:35] Andreas Götz
I've just realized that my argument is not soud. Of cause (byte-wise) 'ab' is a different pattern than '[ab]'.
Suggested fix is still the same though ;)
[14 Feb 2008 18:10] Sergei Golubchik
As far as I understand, you agree that this is not a separate bug but a consequence of REGEX working byte-wise. I'll mark it as a duplicate of bug#34473 then.