MySQL Bugs: #57731: Latvian collation is wrong

Bug #57731	Latvian collation is wrong
Submitted:	26 Oct 2010 8:08	Modified:	18 Jan 2018 13:16
Reporter:	Kaspars Foigts	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S4 (Feature request)
Version:	5.1.52	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	character set, collation

Description:
This sorting-related problem has bugged me (and other fellow Latvian developers) for quite some time and there is no explanation on why this should not work. 

In Latvian language there is specific order of characters. For example, 'a', 'ā', 'b', 'c', 'č', etc. MySQL does not get this sorting order right. For example, 'ā' should follow 'a', when sorted ASCending, but MySQL displays them in arbitrary order, as if they were identical (similar) characters. 

This inconsistency applies only to vowels - a, ā, e, ē, i, ī, o, ō, u, ū.

How to repeat:

SET NAMES 'utf8' COLLATE 'utf8_latvian_ci';

DROP TABLE IF EXISTS `ctest`;
CREATE TABLE `ctest` (
  `keyword` varchar(150) CHARACTER SET utf8 COLLATE utf8_latvian_ci DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_latvian_ci;

INSERT INTO `ctest` VALUES ('a'), ('ā'), ('a'), ('ā'),
-- ('b'), ('c'), ('č'), ('c'), ('č'), ('d'),
('e'), ('ē'), ('e'), ('ē'),
-- ('f'), ('g'), ('ģ'),('g'), ('ģ'), ('h'),
('i'), ('ī'), ('i'), ('ī'),
-- ('j'), ('k'), ('ķ'), ('k'), ('ķ'), ('l'), ('ļ'), ('l'), ('ļ'), ('m'), ('n'), ('ņ'), ('n'), ('ņ'),
('o'), ('ō'), ('o'), ('ō'), 
-- ('p'), ('r'), ('ŗ'), ('r'), ('ŗ'),  ('s'), ('š'), ('s'), ('š'),  ('t'),
('u'), ('ū'), ('u'), ('ū'), 
-- ('v'), ('z'), ('ž'), ('z'), ('ž')
;

SELECT * FROM `ctest` ORDER BY `keyword`;

Actual output:

a,ā,a,ā,e,ē,e,ē,i,ī,i,ī,o,ō,o,ō,u,ū,u,ū

Expected output:

a,a,ā,ā,e,e,ē,ē,i,i,ī,ī,o,o,ō,ō,u,u,ū,ū

Suggested fix:
Fix collation tables for Latvian?

This is what I've got:

macbook-pro:5.1 openxs$ bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.1.52-debug Source distribution

Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved.
This software comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to modify and redistribute it under the GPL v2 license

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> SET NAMES 'utf8' COLLATE 'utf8_latvian_ci';
Query OK, 0 rows affected (0.00 sec)

mysql> 
mysql> DROP TABLE IF EXISTS `ctest`;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> CREATE TABLE `ctest` (
    ->   `keyword` varchar(150) CHARACTER SET utf8 COLLATE utf8_latvian_ci DEFAULT NULL
    -> ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_latvian_ci;
Query OK, 0 rows affected (0.07 sec)

mysql> INSERT INTO `ctest` VALUES ('a'), ('ā'), ('a'), ('ā'),  ('e'), ('ē'), ('e'), ('ē'),  ('i'), ('ī'), ('i'), ('ī'),  ('ņ'), ('n'), ('ņ'), ('o'), ('ō'), ('o'), ('ō'),  ('u'), ('ū'), ('u'), ('ū');
Query OK, 23 rows affected (0.00 sec)
Records: 23  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM `ctest` ORDER BY `keyword`;
+---------+
| keyword |
+---------+
| a       |
| ā      |
| a       |
| ā      |
| e       |
| ē      |
| e       |
| ē      |
| i       |
| ī      |
| i       |
| ī      |
| n       |
| ņ      |
| ņ      |
| o       |
| ō      |
| o       |
| ō      |
| u       |
| ū      |
| u       |
| ū      |
+---------+
23 rows in set (0.00 sec)

This appears to be about a secondary (for ORDER BY)
weighting, and MySQL does only primary weighting (for WHERE etc.).
See also http://forge.mysql.com/worklog/task.php?id=896.

We know that others have had difficulties with
Latvian collation, for example Microsoft
http://blogs.msdn.com/b/michkap/archive/2010/07/13/10037608.aspx

Knowledge of LVS 24:1993? would perhaps be of interest,
but we prefer the Unicode Collation Algorithm and the
Common Locale Data Repository. The relevant lv.xml file
http://www.unicode.org/repos/cldr/tags/release-1-7-1-2/common/collation/lv.xml
doesn't show a primary difference.

I'll mark this as a feature request.

Multi-level collations are on our TODO:
http://forge.mysql.com/worklog/task.php?id=896

In the meanwhile please use this workaround:

mysql> SELECT * FROM `ctest` ORDER BY `keyword`, BINARY keyword;
+---------+
| keyword |
+---------+
| a       | 
| a       | 
| ā       | 
| ā       | 
| e       | 
| e       | 
| ē       | 
| ē       | 
| i       | 
| i       | 
| ī       | 
| ī       | 
| n       | 
| ņ       | 
| ņ       | 
| o       | 
| o       | 
| ō       | 
| ō       | 
| u       | 
| u       | 
| ū       | 
| ū       | 
+---------+
23 rows in set (0.00 sec)

You should note, that Alexander's solution:

*) does not work on variable byte encodings (as utf8);
*) does not use indexes (because of "BINARY `eyword`")

This one works well:

select keyword kw
from ctest 
order by CONVERT(keyword USING ucs2) COLLATE ucs2_bin ASC;

> You should note, that Alexander's solution:
> *) does not work on variable byte encodings (as utf8);

It does work with utf8. I tested with utf8.

> *) does not use indexes (because of "BINARY `eyword`")
That's true, index is not used for ORDER BY optimization in this case.

> This one works well:
> select keyword kw
> from ctest 
> order by CONVERT(keyword USING ucs2) COLLATE ucs2_bin ASC;

There is no need for CONVERT().
It gives exactly the same order with:

select keyword kw
from ctest 
order by keyword COLLATE utf8_bin ASC;

or

select keyword kw
from ctest 
order by BINARY keyword ASC;

But this will sort small and capital letters separately.

The correct query is:

select keyword kw
from ctest 
order by keyword, BINARY keyword;

Sorting binary, does not help for variable length strings:

~~~
DROP TABLE IF EXISTS `ctest`;

CREATE TABLE `ctest` (`keyword` varchar(150) CHARACTER SET utf8 COLLATE utf8_latvian_ci DEFAULT NULL) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_latvian_ci;

INSERT INTO `ctest` (`keyword`) VALUES ('Media'), ('Mobile'), ('Mā'), ('MVAAA'), ('Mēbeļu'), ('Māa'), ('Mās'), ('Marl'), ('Myre');

SELECT * FROM `ctest` ORDER BY `keyword` asc, binary keyword;
~~~

Expected result: 

~~~
Marl
Mā
Māa
Mās
Media
Mēbeļu
Mobile
MVAAA
Myre
~~~

Current result:

~~~
Mā
Māa
Marl
Mās
Mēbeļu
Media
Myre
Mobile
MVAAA
~~~

[8 Jan 2018 19:41] Xing Z Zhang 

Fixed in 8.0.1 with the introduction of utf8mb4_lv_0900_as_cs.

Does it require some more specific configuration for MySQL 8? 
It does not seem to work.

I have tested on MySQL 8.0.12 (Test Online Fiddle 
https://www.db-fiddle.com/f/9phZ7EpRUS4FNRBZRdSYZa/2 ).

Using the same test case except added new LV collationi and utf8mb4 sharacters encoding:

~~~
drop table if exists `ctest`;
CREATE TABLE `ctest`
(
    `keyword` varchar(150)
        CHARACTER
            SET utf8mb4
            COLLATE utf8mb4_lv_0900_as_cs
)
    DEFAULT
        CHARSET = utf8mb4
    COLLATE = utf8mb4_lv_0900_as_cs;

INSERT INTO `ctest` (`keyword`)
VALUES ('Media'),
       ('Mobile'),
       ('Mā'),
       ('MVAAA'),
       ('Mēbeļu'),
       ('Māa'),
       ('Mās'),
       ('Marl'),
       ('Myre');
~~~

Expected result (As Generated by Microsoft Excel 2019 sorting ;) ): 

~~~
Marl
Mā
Māa
Mās
Media
Mēbeļu
Mobile
MVAAA
Myre
~~~

Actual result:

~~~
Mā
Māa
Marl
Mās
Mēbeļu
Media
Myre
Mobile
MVAAA
~~~

Are there any workarounds for this? Provided solutions still aren't sorting correct.