Bug #63439 | REGEXP does not work properly with umlauts | ||
---|---|---|---|
Submitted: | 26 Nov 2011 17:19 | Modified: | 28 Nov 2011 8:35 |
Reporter: | mik gigs | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: DML | Severity: | S3 (Non-critical) |
Version: | 5.5.8, 5.1.60 | OS: | Any (Windows 7, Mac OS X) |
Assigned to: | CPU Architecture: | Any |
[26 Nov 2011 17:19]
mik gigs
[26 Nov 2011 19:58]
Peter Laursen
Interesting! I think this is test case enough (same result on 5.0.90, 5.1.60 and 5.5.18): SET NAMES utf8; -- ascii SELECT 'a' REGEXP '^[a]'; -- 1 SELECT 'a' REGEXP '^[a]$'; -- 1 -- umlauts SELECT 'ä' REGEXP '^[ä]'; -- 1 SELECT 'ä' REGEXP '^[ä]$'; -- 0 -- Danish special letters SELECT 'å' REGEXP '^[å]'; -- 1 SELECT 'å' REGEXP '^[å]$'; -- 0 -- Spanish ñ and Czech/Slovak ž not affected SELECT 'ñ' REGEXP '^[ñ]$'; -- 1 SELECT 'ž' REGEXP '^[ž]$';-- 1 SET CHARACTER_SET_CONNECTION = latin1; SELECT 'ä' REGEXP '^[ä]$'; -- 1 SELECT 'å' REGEXP '^[å]$'; -- 1 -- oops?? Weird IMHO! Peter (not a MySQL person)
[26 Nov 2011 20:07]
Valeriy Kravchuk
Verified with 5.1.60 on Mac OS X: macbook-pro:5.1 openxs$ bin/mysql -uroot test Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 5 Server version: 5.1.60 Source distribution Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> set names utf8; Query OK, 0 rows affected (0.00 sec) mysql> SELECT 'ä' REGEXP '^[ä]'; +---------------------+ | 'ä' REGEXP '^[ä]' | +---------------------+ | 1 | +---------------------+ 1 row in set (0.00 sec) mysql> SELECT 'ä' REGEXP '^[ä]$'; +----------------------+ | 'ä' REGEXP '^[ä]$' | +----------------------+ | 0 | +----------------------+ 1 row in set (0.00 sec) mysql> SET CHARACTER_SET_CONNECTION = latin1; Query OK, 0 rows affected (0.00 sec) mysql> SELECT 'ä' REGEXP '^[ä]$'; +----------------------+ | 'ä' REGEXP '^[ä]$' | +----------------------+ | 1 | +----------------------+ 1 row in set (0.00 sec) ...
[26 Nov 2011 20:16]
mik gigs
Valeriy this is not a solution. Unicode characters means unicode. I think you could understand me, that the world uses not only latin characters but also cyrillic and etc. My dbs are all in unicode, changes in start and ending symbols should not change the behavior of regexp
[26 Nov 2011 20:55]
Peter Laursen
@mik .. I think you don't understand. Valeriy *verified* the bug. That is what his job is as a MySQL supporter. Next MySQL developers will have to *fix* the bug (and/or the MySQL documentation team will have to *document* the behaviour at least). That is the workflow between different MySQL teams in Oracle organization - and also was like that in the time of MySQL AB and Sun. So this is in progress now - simply! Valeriy played the ball to the developers and they are sweating now (I hope)!
[28 Nov 2011 8:35]
mik gigs
Sure, guys. I was little bit confused by that, because infact in my case i utilized changes in codings to process my data(in same manner as Valeriy tested it). Have a good day.
[9 Nov 2016 14:00]
Micha Heigl
This goes even further: select "aä" regexp "a." => returns 1 select "äa" regexp ".a" => returns 1 select "aäa" regexp "a.a" => returns 0 So as soon as an umlaut is used somewhere in the middle the regex misses it. Tried this with 5.7.16 I can't believe this bug is still unresolved since 5 years.