MySQL Bugs: #64578: UTF8 is not UTF8 ?

Bug #64578	UTF8 is not UTF8 ?
Submitted:	7 Mar 2012 7:59	Modified:	6 Jul 2012 15:34
Reporter:	Miran Cvenkel	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S3 (Non-critical)
Version:	5.1.59-community-log, 5.1.63, 5.5.23, 5.6.6	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	utf8

Description:
Méloé=Meloe

See how to repeat.

How to repeat:
DROP TABLE IF EXISTS `_tmp`;
CREATE TABLE IF NOT EXISTS `_tmp` (
  `term` varchar(400) COLLATE utf8_slovenian_ci DEFAULT NULL,
  KEY `Index 1` (`term`(333))
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_slovenian_ci;

-- Dumping data for table test._tmp: 730 rows
DELETE FROM `_tmp`;

INSERT INTO `_tmp` (`term`) VALUES
	('Meloe'),
	('Méloé');

select * from _tmp where term = 'Meloe'

-- You get 2 records !

Try instead: 
SELECT * FROM _tmp WHERE term = 'Meloe' COLLATE utf8_bin;

.. and read about collations: 
http://dev.mysql.com/doc/refman/5.1/en/charset.html

Peter
(not a MySQL person)

.. but I am not able to decide if the demonstrated behaviour of the Slovenian collation is correct or not. What are common alphabetization rules in Slovenian (in dictionaries, phone books etc)?  You probably know better than I!

Thank you for the report.

Please send us output of SHOW VARIABLES LIKE 'col%': collation utf8_slovenian_ci does not have letter é. See http://www.collation-charts.org/mysql60/mysql604.utf8_slovenian_ci.html

Here it is:

collation_connection,utf8_general_ci
collation_database,utf8_slovenian_ci
collation_server,utf8_slovenian_ci

I must say, I found 1 extra unexpected record, but someone would/could delete unexpected records this way. Some warning would be appropriate, if possible.

Thank you for the report.

Verified as described.

test case for MTR

Attachment: bug64578.test (application/octet-stream, text), 410 bytes.

This is not a bug.
MySQL utf8_language_ci collations are accent insensitive.
They treat accented letter as equal to their non-accented
counter parts, unless the language rules say otherwise.
"LATIN LETTER E WITH ACUTE" does not have a special
rule in Slovenian (it's even not a part of Slovenian alphabet),
therefore it follows the default rules and compares
as equal to non-accented "LATIN LETTER E".