MySQL Bugs: #21581: unique key not properly working with utf8_*

Bug #21581	unique key not properly working with utf8_*_ci
Submitted:	11 Aug 2006 12:22	Modified:	22 Sep 2006 8:10
Reporter:	Justinas Liubinskas	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server	Severity:	S4 (Feature request)
Version:	5.0.24	OS:	FreeBSD (freebsd5.4)
Assigned to:		CPU Architecture:	Any
Tags:	collation, lithuanian, unique key, utf8

Description:
I have some tables with unique keys, where I use my native language, lithuanian symbols. field is varchar with utf8_lithuanian_ci collation and unique key. I have one row with 'u' value for the field. When I try to put field with value 'ų', it returns error "duplicate key". However, it works fine with workaround, when I change collation to utf8_bin. It doesn't work with either utf8_general_ci or utf8_unicode_ci. Lithuanians should well understand the logical meaning of this bug. In practice, currently I have two systems that use such mysql tables with lithuanian symbols and unique index, and I have some discomfort about that. I think utf8_lithuanian_ci should also work as fine, as utf8 unique index.

How to repeat:
CREATE TABLE `test` (
  `test` varchar(23) collate utf8_lithuanian_ci NOT NULL,
  UNIQUE KEY `test` (`test`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_lithuanian_ci;

-- 
-- Sukurta duomenų kopija lentelei `test`
-- 

INSERT INTO `test` VALUES ('u');
INSERT INTO `test` VALUES ('ų');

Suggested fix:
I know, that utf8_lithuanian_ci collation tells mysql, that 'u' and 'ų' has to be sorted equally. But in fact, these symbols are different letters, and their meaning to words or sentences are separate. Simply, if the sorting coeficient is equal, it shouldn't be used in unique indexing.

Thank you for the bug report. I consulted the developer responsible
for the character stuff, below his comments:

According to "The State Commition of the Lithianian Language" site:
http://www.vlkk.lt/lithuanian-language/alphabet.htm
only "C WITH CARON", "S WITH CARON" and "Z WITH CARON" are separate
letters in Lithuanian. Other accented letters are a equal
to their non-accented counterparts. Including the letter "U WITH OGONEK"
which is equal to "U".

utf8_lithuanian_ci is an accent insensitive collation.
This is how it was designed. There are no bugs here.

It looks like a feature request.
The reporter seems to want a new collation:
accent sensitive Lithuanian collation for utf8.

Also this bug report might related to bug:
http://bugs.mysql.com/bug.php?id=20247

It isn't so indeed. 
Looking at the same link supplied by You, I can see that Lihuanian letters are sorted not equally. There isn't mentioned about equality of characters You mentuoned.

Miguel Solorzano is partly correct, saying that 'i' and 'į' or 'y' are pronounced similar, that is it would be true to say, that collation is accent-insensitive. However, I then would bring in a question, if collation and unique indexing should cover it, because: example 'siusti' and 'siųsti' are existing lithuanian words with different meaning, 'klausimas' and 'klausymas' also has different meaning, and main difference is that, 'ų' and 'y' are pronounced longer than their equivalents 'u' and 'i'.

This can be implemented only as utf8_lithuanian2_ci, it shouldn't modify utf8_lithuanian_ci behavior, which defines proper dictionary sorting, as defined by VLKK (national language authority)

Not a bug: According to LST 1285:1993 ("Pecularities of the lithuanian language") accented vowels in sorting are to be treated same as their non-accented counterparts. 

binary collations or data types, combined with canonical forms in database may allow to have intended effect - unique constraints based on non-standard ordering schemes.