Bug #37413 MySQL does not recognise diacritics (accents) when performing SELECT
Submitted: 15 Jun 2008 14:18 Modified: 16 Jun 2008 11:27
Reporter: Ken DV Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.0.45 OS:Any
Assigned to: CPU Architecture:Any
Tags: accents, diacritics

[15 Jun 2008 14:18] Ken DV
Description:
SELECT 'éáú' = 'eau' 

Result is: 1

Is there any way to specify that I do not want diacritics to be considered equals of regular letters?

How to repeat:
Perform: SELECT 'éáú' = 'eau'
[15 Jun 2008 19:44] Peter Laursen
I think you should understand COLLATIONS:

See for instance how two different german collations handle UMLAUT characters:
http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html (and related paragraphs.

set collation_connection = utf8_general_ci;
SELECT 'éáú' = 'eau'; -- returns '1'
set collation_connection = utf8_unicode_ci;
SELECT 'éáú' = 'eau'; -- returns '0'

set collation_connection = utf8_danish_ci;
SELECT 'éáú' = 'eau'; -- returns '1'

set collation_connection  = utf8_bin;
SELECT 'éáú' = 'eau'; -- returns '0'

set collation_connection = utf8_general_ci;
-- note now that for a specific statement you can collate as you want
-- and ignore collation_connection setting
SELECT 'éáú' = 'eau' collate utf8_bin; -- returns '0' 

I will only comment on the example with utf8_danish_ci.  This is correct behaviour as e is alphabetized like é in Danish.  What is the case with other COLLATIONS designed to match other natural languages I do not know - there might be some where they are alphabetized differently  (but I have pointed out before that COLLLATIONS provided with MySQL are very rudimentary - for instance _cs (case ensitive) collations are largely missing!).

Anyway you can
1) use the " .. collate utf8_bin" clause in string comparisons.
2) create tables (or even colums) with a "COLLATE utf8_bin" clause for those data where no 'natural language rules' but 'byte value rules' should apply.
.. if you do not want to apply any 'natural language' alphabetisation rules. 

Peter
(not a MySQL person)
[15 Jun 2008 19:46] Peter Laursen
oops .. 3rd-4th line in the example should read

set collation_connection = utf8_unicode_ci;
SELECT 'éáú' = 'eau'; -- returns '1'
[15 Jun 2008 20:30] Peter Laursen
yes ... 

SELECT 'éáú' = 'eau' collate utf8_bin; 
-- or use 'collate latin1_bin' if you have latin1 data

This will use a BINARY collation where no accented and nonaccented characters are considered identical (no matter what is the current charset/collation setting for the connection).  

With a BIANRY collation every character as identified by its byte/octet pattern only. 

Non-binary colllations implement 'natural language rules' for string comparison, filtering, ordering, grouping etc (to the extend that proper collations are available)

Anayway ..  you should use a couple of hours understanding this paragraph
http://dev.mysql.com/doc/refman/5.0/en/charset-charsets.html
.. in the MySQL manual, if you are new to MySQL 4.1 or higher!
[15 Jun 2008 20:31] Peter Laursen
or 

set collation_connection = utf8_bin;
SELECT 'éáú' = 'eau'; -- returns '0'
[16 Jun 2008 11:27] MySQL Verification Team
Thank you for the bug report. Please see the several comments form Peter.