Bug #63439 REGEXP does not work properly with umlauts
Submitted: 26 Nov 2011 17:19 Modified: 28 Nov 2011 8:35
Reporter: mik gigs Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: DML Severity:S3 (Non-critical)
Version:5.5.8, 5.1.60 OS:Any (Windows 7, Mac OS X)
Assigned to: CPU Architecture:Any

[26 Nov 2011 17:19] mik gigs
Description:
This
SELECT 'ääliö' REGEXP '[aä]{1}[aä]{1}li[oö]{1}';
gives
result 1

This
SELECT 'ääliö' REGEXP '^[aä]{1}[aä]{1}li[oö]{1}$';
gives 
result 0 ???? strange

BUT 

this

SELECT 'aalio' REGEXP '^[aä]{1}[aä]{1}li[oö]{1}$';
gives result 1

and this
SELECT 'aalio' REGEXP '[aä]{1}[aä]{1}li[oö]{1}';
also gives result 1

is this normal behavior when adding $ and ^ to strings containing umlauts ?
all field, databases, etc are in unicode UTF8 format

How to repeat:
SELECT 'ääliö' REGEXP '^[aä]{1}[aä]{1}li[oö]{1}$';

Suggested fix:
should be checked in code of REGEXP possible problems with umlauts
[26 Nov 2011 19:58] Peter Laursen
Interesting!  I think this is test case enough (same result on 5.0.90, 5.1.60 and 5.5.18):

SET NAMES utf8;

-- ascii
SELECT 'a' REGEXP '^[a]'; -- 1
SELECT 'a' REGEXP '^[a]$'; -- 1

-- umlauts
SELECT 'ä' REGEXP '^[ä]'; -- 1
SELECT 'ä' REGEXP '^[ä]$'; -- 0

-- Danish special letters
SELECT 'å' REGEXP '^[å]'; -- 1
SELECT 'å' REGEXP '^[å]$'; -- 0

-- Spanish ñ and Czech/Slovak ž not affected
SELECT 'ñ' REGEXP '^[ñ]$'; -- 1
SELECT 'ž' REGEXP '^[ž]$';-- 1  

SET CHARACTER_SET_CONNECTION = latin1;
SELECT 'ä' REGEXP '^[ä]$'; -- 1
SELECT 'å' REGEXP '^[å]$'; -- 1
-- oops?? Weird IMHO!

Peter
(not a MySQL person)
[26 Nov 2011 20:07] Valeriy Kravchuk
Verified with 5.1.60 on Mac OS X:

macbook-pro:5.1 openxs$ bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 5
Server version: 5.1.60 Source distribution

Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT 'ä' REGEXP '^[ä]'; 
+---------------------+
| 'ä' REGEXP '^[ä]' |
+---------------------+
|                   1 |
+---------------------+
1 row in set (0.00 sec)

mysql> SELECT 'ä' REGEXP '^[ä]$';
+----------------------+
| 'ä' REGEXP '^[ä]$' |
+----------------------+
|                    0 |
+----------------------+
1 row in set (0.00 sec)

mysql> SET CHARACTER_SET_CONNECTION = latin1;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT 'ä' REGEXP '^[ä]$';
+----------------------+
| 'ä' REGEXP '^[ä]$' |
+----------------------+
|                    1 |
+----------------------+
1 row in set (0.00 sec)

...
[26 Nov 2011 20:16] mik gigs
Valeriy this is not a solution.
Unicode characters means unicode. 
I think you could understand me, that the world uses not only latin characters
but also cyrillic and etc.
My dbs are all in unicode, changes in start and ending symbols should not change the behavior of regexp
[26 Nov 2011 20:55] Peter Laursen
@mik .. I think you don't understand.

Valeriy *verified* the bug. That is what his job is as a MySQL supporter. Next MySQL developers will have to *fix* the bug (and/or the MySQL documentation team will have to *document* the behaviour at least). 

That is the workflow between different MySQL teams in Oracle organization - and also was like that in the time of MySQL AB and Sun. 

So this is in progress now - simply!  Valeriy played the ball to the developers and they are sweating now (I hope)!
[28 Nov 2011 8:35] mik gigs
Sure, guys. I was little bit confused by that, because infact in my case i utilized changes in codings to process my data(in same manner as Valeriy tested it).
Have a good day.
[9 Nov 2016 14:00] Micha Heigl
This goes even further:

select "aä" regexp "a."   => returns 1
select "äa" regexp ".a"   => returns 1
select "aäa" regexp "a.a" => returns 0

So as soon as an umlaut is used somewhere in the middle the regex misses it. 

Tried this with 5.7.16 I can't believe this bug is still unresolved since 5 years.