Bug #44776 utf8_slovak_ci flawed - incorrect word ordering is returned
Submitted: 11 May 2009 8:06 Modified: 12 May 2009 9:28
Reporter: Marek Hyčko Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Charsets Severity:S4 (Feature request)
Version:5.x, 6.x OS:Any
Assigned to: Assigned Account CPU Architecture:Any
Tags: utf8_slovak_ci incorrect ordering ORDER BY
Triage: Needs Triage: D5 (Feature request)

[11 May 2009 8:06] Marek Hyčko
Description:
Slovak alphabet consists of these characters in this order:

a, á, ä, b, c, č, d, ď, (dz), (dž), e, é, f, g, h, (ch), i, í, j, k, l, ĺ, ľ, m, n, ň, o, ó, ô, p, q, r, ŕ, s, š, t, ť, u, ú, v, w, x, y, ý, z, ž 

For the reference, please see 
http://sk.wikipedia.org/wiki/Slovensk%C3%A1_abeceda. 
(It is in Slovak, but letter order should be clear.)

For some reason utf8_slovak_ci (according to http://www.collation-charts.org/mysql60/mysql604.utf8_slovak_ci.html) unifies (in order) some letters, e.g. d = ď, which causes wrong results by ORDER BY clause in SQL. I do not understand, why some letters are different, e.g. c != č and some not (previously mentioned d, ď). 

It would be good to have it implemented as full - according to specific language - ordering, not just partial order unifying certain letters. (something like Slovenian collation: http://www.collation-charts.org/mysql60/mysql604.utf8_slovenian_ci.html) It would be more logical for ci to be just case insensitive, not other letter unifying.

How to repeat:
To the text field equiped with utf8_slovak_ci collation insert two records

ďa, dá

Then select it with ORDER BY this column and it gives ďa, dá, but correct order should be dá, ďa (because d < ď).

Other problem: SELECT ... `col` LIKE 'da' gives both results, but in fact, it shouldn't give any of them.

Tested in version 5.0.51a-24+lenny1 ..., but according to collation definitions in 6.04, the problem persists.

Suggested fix:
(Repeating the last paragraph.) It would be good to have it implemented as full - according to specific language - ordering, not just partial order unifying certain letters. (something like Slovenian collation: http://www.collation-charts.org/mysql60/mysql604.utf8_slovenian_ci.html)
[12 May 2009 9:28] Sveta Smirnova
Thank you for the reasonable feature request.
[29 Jul 2010 13:29] Alexander Barkov
ICU collation customization for Slovak

Attachment: sk.xml (text/xml), 1.43 KiB.

[29 Jul 2010 13:36] Alexander Barkov
utf8_slovak_ci is done according to the collation
defined in ICU's sk.xml (attached), in these tags:

<collation type="standard" draft="true" alt="proposed">
...
</collation>

sk.xml also has another collation definition defined here:
<collation type="standard">
...
</collation>
These rules seem to be the same with what Marek describes.