MySQL Bugs: #37413: MySQL does not recognise diacritics (accents) when performing SELECT

Bug #37413	MySQL does not recognise diacritics (accents) when performing SELECT
Submitted:	15 Jun 2008 14:18	Modified:	16 Jun 2008 11:27
Reporter:	Ken DV	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S3 (Non-critical)
Version:	5.0.45	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	accents, diacritics

Description:
SELECT 'éáú' = 'eau' 

Result is: 1

Is there any way to specify that I do not want diacritics to be considered equals of regular letters?

How to repeat:
Perform: SELECT 'éáú' = 'eau'

I think you should understand COLLATIONS:

See for instance how two different german collations handle UMLAUT characters:
http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html (and related paragraphs.

set collation_connection = utf8_general_ci;
SELECT 'éáú' = 'eau'; -- returns '1'
set collation_connection = utf8_unicode_ci;
SELECT 'éáú' = 'eau'; -- returns '0'

set collation_connection = utf8_danish_ci;
SELECT 'éáú' = 'eau'; -- returns '1'

set collation_connection  = utf8_bin;
SELECT 'éáú' = 'eau'; -- returns '0'

set collation_connection = utf8_general_ci;
-- note now that for a specific statement you can collate as you want
-- and ignore collation_connection setting
SELECT 'éáú' = 'eau' collate utf8_bin; -- returns '0' 

I will only comment on the example with utf8_danish_ci.  This is correct behaviour as e is alphabetized like é in Danish.  What is the case with other COLLATIONS designed to match other natural languages I do not know - there might be some where they are alphabetized differently  (but I have pointed out before that COLLLATIONS provided with MySQL are very rudimentary - for instance _cs (case ensitive) collations are largely missing!).

Anyway you can
1) use the " .. collate utf8_bin" clause in string comparisons.
2) create tables (or even colums) with a "COLLATE utf8_bin" clause for those data where no 'natural language rules' but 'byte value rules' should apply.
.. if you do not want to apply any 'natural language' alphabetisation rules. 

Peter
(not a MySQL person)

oops .. 3rd-4th line in the example should read

set collation_connection = utf8_unicode_ci;
SELECT 'éáú' = 'eau'; -- returns '1'

yes ... 

SELECT 'éáú' = 'eau' collate utf8_bin; 
-- or use 'collate latin1_bin' if you have latin1 data

This will use a BINARY collation where no accented and nonaccented characters are considered identical (no matter what is the current charset/collation setting for the connection).  

With a BIANRY collation every character as identified by its byte/octet pattern only. 

Non-binary colllations implement 'natural language rules' for string comparison, filtering, ordering, grouping etc (to the extend that proper collations are available)

Anayway ..  you should use a couple of hours understanding this paragraph
http://dev.mysql.com/doc/refman/5.0/en/charset-charsets.html
.. in the MySQL manual, if you are new to MySQL 4.1 or higher!

or 

set collation_connection = utf8_bin;
SELECT 'éáú' = 'eau'; -- returns '0'

Thank you for the bug report. Please see the several comments form Peter.