Bug #115368 | wrong sorting - collation utf8mb4_slovak_ci | ||
---|---|---|---|
Submitted: | 18 Jun 2024 12:59 | Modified: | 19 Jun 2024 9:44 |
Reporter: | Miroslav Komorný | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Server | Severity: | S3 (Non-critical) |
Version: | 8.0.37 | OS: | Ubuntu (22.04) |
Assigned to: | CPU Architecture: | Any | |
Tags: | collation, utf8mb4_slovak_ci |
[18 Jun 2024 12:59]
Miroslav Komorný
[18 Jun 2024 13:30]
MySQL Verification Team
Hi Mr. Komornyi, Thank you very much for your bug report. However, let us inform you that this is a forum for the bugs with repeatable test cases. Each test case should contain a set of SQL statements that always lead to the bug that is reported. We have not received such a report from you. This is not a forum for asking questions. Furthermore, all character sets and collations are not defined by MySQL, but by Unicode and SQL standards. Hence, we can not change them. This is not a bug.
[18 Jun 2024 14:01]
Miroslav Komorný
Ok, I didn't know how detailed the report should be. Here's how to reproduce the wrong sorting: CREATE TABLE `list_of_names` ( `name` varchar(255) COLLATE utf8mb4_slovak_ci, KEY `nazov` (`name`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_slovak_ci; insert into list_of_names (name) values ("abčd - 1"); insert into list_of_names (name) values ("abce - 2"); insert into list_of_names (name) values ("abšd - 3"); insert into list_of_names (name) values ("abse - 4"); insert into list_of_names (name) values ("abžd - 5"); insert into list_of_names (name) values ("abze - 6"); select name from list_of_names order by name; I don't know if you or some other standardization institution is responsible for this error. So where do I report a missort in mysql?
[18 Jun 2024 14:13]
MySQL Verification Team
Hi, The sorting is 100 % compatible with Unicode standard: +-----------+ | name | +-----------+ | abce - 2 | | abčd - 1 | | abse - 4 | | abšd - 3 | | abze - 6 | | abžd - 5 | +-----------+ You should report this to the Slovak branch of the Unicode committee. However, it seems to us that sorting is exactly as defined in Unicode ....... If you have any better and more respected sources for the Unicode sorting of the Slovak alphabet, please provide the valid URL for it. Otherwise: Not a bug.
[18 Jun 2024 14:27]
Miroslav Komorný
The sorting is in logical contradiction, e.g. with proper sorting: insert into list_of_names (name) values ("adíd - 7"); insert into list_of_names (name) values ("adie - 8"); I will therefore look for a way to achieve the correct sorting on the Slovak unicode branch (if there is one).
[18 Jun 2024 19:58]
Bernt Marius Johnsen
According to CLDR (Common Locale Data Repository), UCA (Unicode Collation Algorithm) for Unicode 9.0.0 which is the base of utf8mb4_sk_0900_ai_ci and which behaves identical to utf8mb4_slovak_ci, the order is defined as <collation type="standard" references="Aliberto Caforio: Slovensko-anglický slovník ISBN 80-967744-8-4"> <cr><![CDATA[ &A<ä<<<Ä &C<č<<<Č &H<ch<<<cH<<<Ch<<<CH &O<ô<<<Ô &R<ř<<<Ř &S<š<<<Š &Z<ž<<<Ž ]]></cr> </collation> How to interpret this is defined in https://unicode.org/reports/tr35/tr35-collation.html#Orderings, but it means that e.g. 'C' (and 'c') comes before (and is not equal to) č in UCA PRIMARY collation for Slovak, which is the same as MySQL utf8mb4_sk_0900_ai_ci. 'i' on the other hand is equal to 'í' in the same collations. So it IS logical and correct as far as I am able to interpret the specifications.
[19 Jun 2024 8:54]
Bernt Marius Johnsen
To elaborate a bit further: 1. utf8mb4_slovak_ci is not recommended. It does not collate correct for characters outside BMP (Codepoints > 0xFFFF). You should use utf8mb4_sk_0900_ai_ci instead. 2. The order you are looking for seems to be utf8mb4_0900_ai_ci: mysql> select * from list_of_names order by name collate utf8mb4_0900_ai_ci; +-----------+ | name | +-----------+ | abčd - 1 | | abce - 2 | | abšd - 3 | | abse - 4 | | abžd - 5 | | abze - 6 | | adíd - 7 | | adie - 8 | +-----------+ 8 rows in set (0.00 sec)
[19 Jun 2024 9:44]
Miroslav Komorný
Yes! Collation utf8mb4_sk_0900_ai_ci appears to be consistent. I do not yet know how it is according to the Slovak Institute of Linguistics (I am currently checking), but the collation utf8mb4_0900_ai_ci is what I expected from utf8mb4_slovak_ci. Thanks