Bug #55970 incorrect implementation of sorting in utf8_slovak_ci
Submitted: 13 Aug 2010 13:19 Modified: 18 Aug 2010 8:11
Reporter: Stanislav LOFAJ Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server Severity:S4 (Feature request)
Version:6.0.11-alpha, 5.1.49 etc. OS:Any
Assigned to: Assigned Account CPU Architecture:Any
Tags: collation, Contribution, server, slovak

[13 Aug 2010 13:19] Stanislav LOFAJ
Description:
Incorrect comparison with collation utf8_slovak_ci with letters ď,ť,ň,ĺ,ľ,é,ŕ,ú and other letters.
I found in source code for MySQL mysql-5.1.49 this implementation, but I think that is not enough. 
strings/ctype-uca.c
static const char slovak[]=
    "& A < \\u00E4 <<< \\u00C4"
    "& C < \\u010D <<< \\u010C"
    "& H < ch <<< Ch <<< CH"
    "& O < \\u00F4 <<< \\u00D4"
    "& S < \\u0161 <<< \\u0160"
    "& Z < \\u017E <<< \\u017D";

letters d versus ď which is completely different letter like s versus š.

How to repeat:
CREATE TABLE sk_test
(ID INT NOT NULL AUTO_INCREMENT ,
hodnota VARCHAR( 255 ) CHARACTER SET utf8  COLLATE utf8_slovak_ci  not null, 
PRIMARY KEY (ID)
) ENGINE=InnoDB CHARACTER SET utf8 COLLATE utf8_slovak_ci  ;

insert into sk_test (hodnota) values ('s');
insert into sk_test (hodnota) values ('š');
insert into sk_test (hodnota) values ('d');
insert into sk_test (hodnota) values ('ď');

select * from sk_test where hodnota ="d"; -- INCORRECT: return 2 rows with d ad ď
select * from sk_test where hodnota ="ď"; -- INCORRECT: return 2 rows with d ad ď

select * from sk_test where hodnota ="š"; -- CORRECT: return 1 rows with š
select * from sk_test where hodnota ="s"; -- CORRECT: return 1 rows with s

Suggested fix:

My suggestion is implement collation utf8_slovak_ci :

  <collation name="utf8_slovak_ci"		id="XXXXX">
    <rules>
      <reset>A</reset><p>á</p><t>Á</t><p>ä</p><t>Ä</t>
      <reset>C</reset><p>č</p><t>Č</t>
      <reset>D</reset><p>ď</p><t>Ď</t><p>dz</p><t>Dz</t><t>DZ</t><p>dž</p><t>Dž</t><t>DŽ</t>
      <reset>E</reset><p>é</p><t>É</t>
      <reset>H</reset><p>ch</p><t>Ch</t><t>CH</t>
      <reset>I</reset><p>í</p><t>Í</t>
      <reset>L</reset><p>ĺ</p><t>Ĺ</t><p>ľ</p><t>Ľ</t>
      <reset>N</reset><p>ň</p><t>Ň</t>
      <reset>O</reset><p>ó</p><t>Ó</t><p>ô</p><t>Ô</t>
      <reset>R</reset><p>ŕ</p><t>Ŕ</t>
      <reset>S</reset><p>š</p><t>Š</t>
      <reset>T</reset><p>ť</p><t>Ť</t>
      <reset>U</reset><p>ú</p><t>Ú</t>
      <reset>Y</reset><p>ý</p><t>Ý</t>
      <reset>Z</reset><p>ž</p><t>Ž</t>
    </rules>
  </collation>
[13 Aug 2010 18:22] Sveta Smirnova
Thank you for the report.

According to this table: http://www.collation-charts.org/mysql60/mysql604.utf8_slovak_ci.html this is not a bug, but I'll ask our collation experts to look into this report.
[16 Aug 2010 8:36] Alexander Barkov
CLDR collation description for Slovak

Attachment: sk.xml (text/xml), 1.43 KiB.

[16 Aug 2010 9:01] Alexander Barkov
utf8_slovak_ci is implemented according to Unicode's Common Locale Data Repository. See the definition file sk.xml in the "Files" section of
this bug report.

sk.xml has two version of collations.
The first version is marked as 
<collation type="standard">
...
</colation>

The second version is marked as:
<collation type="standard" draft="true" alt="proposed">
...
</collation>

MySQL implements the second version, which says that 
only letters ä,č,ô,š,ž are separate letters, and the
other accented letters have their default Unicode sorting.

Oracle agrees:
http://www.collation-charts.org/oracle10g/ora10g.EE8MSWIN1250.XSLOVAK.html

Microsoft agrees:
http://www.collation-charts.org/vista/vista.041B.CP1250.Slovak_Slovakia.html

The first version fron sk.xml additionally treats letters đ,ł,ř,ż

as separate letters from their non-accented counterparts d,l,r,z.

However, non of the two collations say that ď,ť,ň,ĺ,ľ,é,ŕ,ú must be separate letters.

So from what I can see you need accent sensitive version of Slovak collation.
Accent sensitive collations with good sorting are currently on our TODO
and require this task to be done first:
http://forge.mysql.com/worklog/task.php?id=896

In the meantime you can define your own version using Index.xml file,
which will also redefine the order of the letters ď,ť,ň,ĺ,ľ,é,ŕ,ú.
[16 Aug 2010 9:12] Alexander Barkov
I just noticed that the latest copy of sk.xml defines
only a single collation version:

http://unicode.org/cldr/trac/browser/trunk/common/collation/sk.xml

which exactly what MySQL implements.
[16 Aug 2010 15:45] Sveta Smirnova
Thank you for the report.

This is feature request. Verifying it as such.