Bug #57731 Latvian collation is wrong
Submitted: 26 Oct 2010 8:08 Modified: 18 Jan 2018 13:16
Reporter: Kaspars Foigts Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S4 (Feature request)
Version:5.1.52 OS:Any
Assigned to: Assigned Account CPU Architecture:Any
Tags: character set, collation

[26 Oct 2010 8:08] Kaspars Foigts
Description:
This sorting-related problem has bugged me (and other fellow Latvian developers) for quite some time and there is no explanation on why this should not work. 

In Latvian language there is specific order of characters. For example, 'a', 'ā', 'b', 'c', 'č', etc. MySQL does not get this sorting order right. For example, 'ā' should follow 'a', when sorted ASCending, but MySQL displays them in arbitrary order, as if they were identical (similar) characters. 

This inconsistency applies only to vowels - a, ā, e, ē, i, ī, o, ō, u, ū.

How to repeat:

SET NAMES 'utf8' COLLATE 'utf8_latvian_ci';

DROP TABLE IF EXISTS `ctest`;
CREATE TABLE `ctest` (
  `keyword` varchar(150) CHARACTER SET utf8 COLLATE utf8_latvian_ci DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_latvian_ci;

INSERT INTO `ctest` VALUES ('a'), ('ā'), ('a'), ('ā'),
-- ('b'), ('c'), ('č'), ('c'), ('č'), ('d'),
('e'), ('ē'), ('e'), ('ē'),
-- ('f'), ('g'), ('ģ'),('g'), ('ģ'), ('h'),
('i'), ('ī'), ('i'), ('ī'),
-- ('j'), ('k'), ('ķ'), ('k'), ('ķ'), ('l'), ('ļ'), ('l'), ('ļ'), ('m'), ('n'), ('ņ'), ('n'), ('ņ'),
('o'), ('ō'), ('o'), ('ō'), 
-- ('p'), ('r'), ('ŗ'), ('r'), ('ŗ'),  ('s'), ('š'), ('s'), ('š'),  ('t'),
('u'), ('ū'), ('u'), ('ū'), 
-- ('v'), ('z'), ('ž'), ('z'), ('ž')
;

SELECT * FROM `ctest` ORDER BY `keyword`;

Actual output:

a,ā,a,ā,e,ē,e,ē,i,ī,i,ī,o,ō,o,ō,u,ū,u,ū

Expected output:

a,a,ā,ā,e,e,ē,ē,i,i,ī,ī,o,o,ō,ō,u,u,ū,ū

Suggested fix:
Fix collation tables for Latvian?
[27 Oct 2010 12:01] Valeriy Kravchuk
This is what I've got:

macbook-pro:5.1 openxs$ bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.1.52-debug Source distribution

Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved.
This software comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to modify and redistribute it under the GPL v2 license

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> SET NAMES 'utf8' COLLATE 'utf8_latvian_ci';
Query OK, 0 rows affected (0.00 sec)

mysql> 
mysql> DROP TABLE IF EXISTS `ctest`;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> CREATE TABLE `ctest` (
    ->   `keyword` varchar(150) CHARACTER SET utf8 COLLATE utf8_latvian_ci DEFAULT NULL
    -> ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_latvian_ci;
Query OK, 0 rows affected (0.07 sec)

mysql> INSERT INTO `ctest` VALUES ('a'), ('ā'), ('a'), ('ā'),  ('e'), ('ē'), ('e'), ('ē'),  ('i'), ('ī'), ('i'), ('ī'),  ('ņ'), ('n'), ('ņ'), ('o'), ('ō'), ('o'), ('ō'),  ('u'), ('ū'), ('u'), ('ū');
Query OK, 23 rows affected (0.00 sec)
Records: 23  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM `ctest` ORDER BY `keyword`;
+---------+
| keyword |
+---------+
| a       |
| ā      |
| a       |
| ā      |
| e       |
| ē      |
| e       |
| ē      |
| i       |
| ī      |
| i       |
| ī      |
| n       |
| ņ      |
| ņ      |
| o       |
| ō      |
| o       |
| ō      |
| u       |
| ū      |
| u       |
| ū      |
+---------+
23 rows in set (0.00 sec)
[27 Oct 2010 18:18] Peter Gulutzan
This appears to be about a secondary (for ORDER BY)
weighting, and MySQL does only primary weighting (for WHERE etc.).
See also http://forge.mysql.com/worklog/task.php?id=896.

We know that others have had difficulties with
Latvian collation, for example Microsoft
http://blogs.msdn.com/b/michkap/archive/2010/07/13/10037608.aspx

Knowledge of LVS 24:1993? would perhaps be of interest,
but we prefer the Unicode Collation Algorithm and the
Common Locale Data Repository. The relevant lv.xml file
http://www.unicode.org/repos/cldr/tags/release-1-7-1-2/common/collation/lv.xml
doesn't show a primary difference.

I'll mark this as a feature request.
[28 Oct 2010 11:21] Alexander Barkov
Multi-level collations are on our TODO:
http://forge.mysql.com/worklog/task.php?id=896

In the meanwhile please use this workaround:

mysql> SELECT * FROM `ctest` ORDER BY `keyword`, BINARY keyword;
+---------+
| keyword |
+---------+
| a       | 
| a       | 
| ā       | 
| ā       | 
| e       | 
| e       | 
| ē       | 
| ē       | 
| i       | 
| i       | 
| ī       | 
| ī       | 
| n       | 
| ņ       | 
| ņ       | 
| o       | 
| o       | 
| ō       | 
| ō       | 
| u       | 
| u       | 
| ū       | 
| ū       | 
+---------+
23 rows in set (0.00 sec)
[28 Oct 2010 11:56] Kaspars Foigts
You should note, that Alexander's solution:

*) does not work on variable byte encodings (as utf8);
*) does not use indexes (because of "BINARY `eyword`")
[28 Oct 2010 12:04] Kaspars Foigts
This one works well:

select keyword kw
from ctest 
order by CONVERT(keyword USING ucs2) COLLATE ucs2_bin ASC;
[28 Oct 2010 13:46] Alexander Barkov
> You should note, that Alexander's solution:
> *) does not work on variable byte encodings (as utf8);

It does work with utf8. I tested with utf8.

> *) does not use indexes (because of "BINARY `eyword`")
That's true, index is not used for ORDER BY optimization in this case.

> This one works well:
> select keyword kw
> from ctest 
> order by CONVERT(keyword USING ucs2) COLLATE ucs2_bin ASC;

There is no need for CONVERT().
It gives exactly the same order with:

select keyword kw
from ctest 
order by keyword COLLATE utf8_bin ASC;

or

select keyword kw
from ctest 
order by BINARY keyword ASC;

But this will sort small and capital letters separately.

The correct query is:

select keyword kw
from ctest 
order by keyword, BINARY keyword;
[3 Feb 2016 14:10] Arnis Juraga
Sorting binary, does not help for variable length strings:

~~~
DROP TABLE IF EXISTS `ctest`;

CREATE TABLE `ctest` (`keyword` varchar(150) CHARACTER SET utf8 COLLATE utf8_latvian_ci DEFAULT NULL) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_latvian_ci;

INSERT INTO `ctest` (`keyword`) VALUES ('Media'), ('Mobile'), ('Mā'), ('MVAAA'), ('Mēbeļu'), ('Māa'), ('Mās'), ('Marl'), ('Myre');

SELECT * FROM `ctest` ORDER BY `keyword` asc, binary keyword;
~~~

Expected result: 

~~~
Marl
Mā
Māa
Mās
Media
Mēbeļu
Mobile
MVAAA
Myre
~~~

Current result:

~~~
Mā
Māa
Marl
Mās
Mēbeļu
Media
Myre
Mobile
MVAAA
~~~
[18 Jan 2018 13:16] Erlend Dahl
[8 Jan 2018 19:41] Xing Z Zhang 

Fixed in 8.0.1 with the introduction of utf8mb4_lv_0900_as_cs.
[24 Jun 2019 19:00] Arnis Juraga
Does it require some more specific configuration for MySQL 8? 
It does not seem to work.

I have tested on MySQL 8.0.12 (Test Online Fiddle 
https://www.db-fiddle.com/f/9phZ7EpRUS4FNRBZRdSYZa/2 ).

Using the same test case except added new LV collationi and utf8mb4 sharacters encoding:

~~~
drop table if exists `ctest`;
CREATE TABLE `ctest`
(
    `keyword` varchar(150)
        CHARACTER
            SET utf8mb4
            COLLATE utf8mb4_lv_0900_as_cs
)
    DEFAULT
        CHARSET = utf8mb4
    COLLATE = utf8mb4_lv_0900_as_cs;

INSERT INTO `ctest` (`keyword`)
VALUES ('Media'),
       ('Mobile'),
       ('Mā'),
       ('MVAAA'),
       ('Mēbeļu'),
       ('Māa'),
       ('Mās'),
       ('Marl'),
       ('Myre');
~~~

Expected result (As Generated by Microsoft Excel 2019 sorting ;) ): 

~~~
Marl
Mā
Māa
Mās
Media
Mēbeļu
Mobile
MVAAA
Myre
~~~

Actual result:

~~~
Mā
Māa
Marl
Mās
Mēbeļu
Media
Myre
Mobile
MVAAA
~~~
[26 Sep 7:52] Mārtiņš Ručevskis
Are there any workarounds for this? Provided solutions still aren't sorting correct.