Bug #64578 UTF8 is not UTF8 ?
Submitted: 7 Mar 2012 7:59 Modified: 6 Jul 2012 15:34
Reporter: Miran Cvenkel Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.1.59-community-log, 5.1.63, 5.5.23, 5.6.6 OS:Any
Assigned to: CPU Architecture:Any
Tags: utf8

[7 Mar 2012 7:59] Miran Cvenkel
Description:
Méloé=Meloe

See how to repeat.

How to repeat:
DROP TABLE IF EXISTS `_tmp`;
CREATE TABLE IF NOT EXISTS `_tmp` (
  `term` varchar(400) COLLATE utf8_slovenian_ci DEFAULT NULL,
  KEY `Index 1` (`term`(333))
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_slovenian_ci;

-- Dumping data for table test._tmp: 730 rows
DELETE FROM `_tmp`;

INSERT INTO `_tmp` (`term`) VALUES
	('Meloe'),
	('Méloé');

select * from _tmp where term = 'Meloe'

-- You get 2 records !
[7 Mar 2012 12:42] Peter Laursen
Try instead: 
SELECT * FROM _tmp WHERE term = 'Meloe' COLLATE utf8_bin;

.. and read about collations: 
http://dev.mysql.com/doc/refman/5.1/en/charset.html

Peter
(not a MySQL person)
[7 Mar 2012 12:53] Peter Laursen
.. but I am not able to decide if the demonstrated behaviour of the Slovenian collation is correct or not. What are common alphabetization rules in Slovenian (in dictionaries, phone books etc)?  You probably know better than I!
[7 Mar 2012 16:18] Sveta Smirnova
Thank you for the report.

Please send us output of SHOW VARIABLES LIKE 'col%': collation utf8_slovenian_ci does not have letter é. See http://www.collation-charts.org/mysql60/mysql604.utf8_slovenian_ci.html
[10 Mar 2012 1:49] Miran Cvenkel
Here it is:

collation_connection,utf8_general_ci
collation_database,utf8_slovenian_ci
collation_server,utf8_slovenian_ci
[10 Mar 2012 1:56] Miran Cvenkel
I must say, I found 1 extra unexpected record, but someone would/could delete unexpected records this way. Some warning would be appropriate, if possible.
[10 Mar 2012 8:59] Sveta Smirnova
Thank you for the report.

Verified as described.
[10 Mar 2012 8:59] Sveta Smirnova
test case for MTR

Attachment: bug64578.test (application/octet-stream, text), 410 bytes.

[6 Jul 2012 15:33] Alexander Barkov
This is not a bug.
MySQL utf8_language_ci collations are accent insensitive.
They treat accented letter as equal to their non-accented
counter parts, unless the language rules say otherwise.
"LATIN LETTER E WITH ACUTE" does not have a special
rule in Slovenian (it's even not a part of Slovenian alphabet),
therefore it follows the default rules and compares
as equal to non-accented "LATIN LETTER E".