Bug #44635 LIKE operator does not distinguish accented and non-accented characters
Submitted: 4 May 2009 8:41 Modified: 4 May 2009 9:17
Reporter: Peter Laursen (Basic Quality Contributor) Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:5.1.34 OS:Any
Assigned to: CPU Architecture:Any

[4 May 2009 8:41] Peter Laursen
Description:
LIKE operator does not distinguish accented and non-accented characters.

How to repeat:
DROP TABLE IF EXISTS umlaut; 

CREATE TABLE `umlaut` (                                
          `id` INT(11) NOT NULL AUTO_INCREMENT,                
          `txt` VARCHAR(50) DEFAULT NULL,                      
          PRIMARY KEY (`id`)                                   
        ) ENGINE=INNODB DEFAULT CHARSET=utf8;
        
INSERT INTO umlaut (txt) VALUES ('åbcdef'),('tuvxyz'),('äüöñîè'),('Donauwörth');

SELECT id, txt FROM umlaut WHERE txt LIKE '%z%';
/* returns what is expected

    id  txt   
------  ------
     2  tuvxyz
*/

SELECT id, txt FROM umlaut WHERE txt LIKE '%a%';
/* returns strings with both accented and non-accented 'a'

    id  txt         
------  ------------
     1  åbcdef     
     3  äüöñîè
     4  Donauwörth 
*/

SELECT id, txt FROM umlaut WHERE txt LIKE '%n%';
/* returns strings with both accented and non-accented 'n'.

    id  txt         
------  ------------
     3  äüöñîè
     4  Donauwörth 
*/

Suggested fix:
'a', 'ä' and 'å' are distinct and different characters as are 'n' and 'ñ'. 

They should not be considered aLIKE!
[4 May 2009 8:50] Peter Laursen
Of course if I 

SELECT id, txt FROM umlaut WHERE txt LIKE '%a%' COLLATE utf8_bin;

I get only one row in result.

My settings:

SHOW VARIABLES LIKE '%collation%';

/*
Variable_name         Value          
--------------------  ---------------
collation_connection  utf8_general_ci
collation_database    utf8_general_ci
collation_server      utf8_general_ci
*/

It is in my opinion *definitely* a collation bug with 'a' and 'å' comparison (and not the only time I reported misunderstandings with the Danish 'å' character). 

Maybe the 'a'='ä' and 'n'='ñ' is as expected.
[4 May 2009 9:17] Peter Laursen
OK .. expected behaviour with the general collation!

SELECT 'a' = 'ä' COLLATE utf8_general_ci; -- 1
SELECT 'a' = 'ä' COLLATE utf8_swedish_ci; -- 0

SELECT 'a' = 'å' COLLATE utf8_general_ci; -- 1
SELECT 'a' = 'å' COLLATE utf8_danish_ci; -- 0

SELECT 'n' = 'ñ' COLLATE utf8_general_ci; -- 1
SELECT 'n' = 'ñ' COLLATE utf8_spanish_ci; -- 0