MySQL Bugs: #41106: utf8_lithuanian

Bug #41106	utf8_lithuanian_ci has errors
Submitted:	28 Nov 2008 16:15	Modified:	29 Nov 2008 10:41
Reporter:	Steponas K.	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S3 (Non-critical)
Version:	5.0 +	OS:	Any (Different symbols are treated as equal)
Assigned to:		CPU Architecture:	Any
Tags:	collation error

Description:
Hello.

There are problems with the MySQL collation "utf8_lithuanian_ci". It treats different (logically and in sound) symbols as equal.

As i have seen in http://www.collation-charts.org/mysql60/mysql604.utf8_lithuanian_ci.html , it treats "i" as the same symbol as "y" - which is totaly different in our language! That's why I was getting constraint violation errors on unique text indexes. And it misses some lithuanian characters too.

How to repeat:
CREATE TABLE IF NOT EXISTS `test` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `chr` varchar(50) CHARACTER SET utf8 COLLATE utf8_lithuanian_ci NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `chr` (`chr`)
);

INSERT INTO `test` (`chr`) VALUES ('aay'), ('aai');

Expected: is inserted successfuly.
Results : Duplicate entry 'aai' for key 'chr' .

Suggested fix:
The full and correct alphabet (and the collation layout) can be found at http://en.wikipedia.org/wiki/Lithuanian_language#Writing_system . Yes, "y" represents a prolonged "i" BUT that does and must not affect the sorting and, by no means, be treated as equal.

If this is not the correct bug system to post this please direct me to the right one. And if this is, I'm asking you to:
    - first of all, make "I" and "Y" and "Į", "i" and "y" and "į", "U" and "Ų" and "Ū", "u" and "ų" and "ū" different symbols. They may sound equal, but can not be treated as equal in PCs,
    - add the missing letters "Ę" and "ę", "Ė" and "ė"
    - remove the letters "Q" and "q", "W" and "w", "X" and "x", "Q" and "q" - they do not exist in our alphabet
    - remove all characters after "C" and "c" (like "Ch", "ch") - they are not characters, but char. combinations that produce a sound. This can sometimes be useful but can not be treated as a norm.

This means, the aplhabet must be exaclty as it is listed in my given Wikipedia link. Any other version is inapropriate.

According to official language bodies in Lithuania (Standards institute & National language comission), i has same sorting weight as y.

I should probably be more verbose.

According to Lithuanian standards, these letters should be treated equally when sorting:

a and ą
e, ę and ė
i, į and y
u, ų and ū

these ones are treated as separate characters:
c and č, s and š, z and ž

Other letters (such as q, x and w) do not interfer with lithuanian language ordering and may follow standard latin order.

I uderstand the problem with the sorting order. If it's a lithuanian standart then the enquiry must go to the lithuaninan institutions.

But what to do with unique indexes? There are only such solutions as writing *workaraunds* that inplement unique index functionalinty without actually adding an unique index on that column. The more logic way would be to use a non-lithuanian collation. Because the comparison operators (=, LIKE) work not as I and most lithuanian programmers would expect them to work.

Well but if it is "according to national rules" then I can't do anything except to accept it as it is. Even if it does not match my "natural order" algorythm.

Thank you, good luck.