Bug #16373 problem with sorting croatian letters
Submitted: 11 Jan 2006 13:56 Modified: 19 May 2006 11:28
Reporter: Tomislav Rajaković Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.0.19-BK, 5.0.16 OS:Linux (Linux, Windows)
Assigned to: Alexander Barkov CPU Architecture:Any

[11 Jan 2006 13:56] Tomislav Rajaković
Description:
MySQL does not correctly handles croatian's diacryptics letters when sorting ("č","ć","đ","š","ž"). Almoust all letters are on wrong places.

How to repeat:
CREATE TABLE words
(
  Word   VARCHAR(40) NOT NULL,
  UNIQUE INDEX(Word(40)),
) ENGINE=MYISAM CHECKSUM=1 CHARACTER SET latin2 COLLATE latin2_croatian_ci;
INSERT INTO Words (Word) VALUES ('abc');
INSERT INTO Words (Word) VALUES ('bbc');
INSERT INTO Words (Word) VALUES ('čbc');
INSERT INTO Words (Word) VALUES ('ćbc');
INSERT INTO Words (Word) VALUES ('zzz');
INSERT INTO Words (Word) VALUES ('žzz');
SELECT Word FROM Words ORDER BY Word;

Suggested fix:
Croatian alphabet has 30 letters, + non croatians("X","Y","W") which we use with foreign phrazes,words so they are some kind part of alphabet and still need their places.

Order should be: A,B,C,Č,Ć,D,DŽ,Đ,E,F,G,H,I,J,K,L,LJ,M,N,NJ,O,P,Q,R,S,Š,T,U,V,W,X,Y,Z,Ž

note that: "DŽ" is one letter (sound) (same with "LJ" and "NJ" ) consist of two letters "D" and "Ž" and whenever "DŽ","LJ" or "NJ" occurs,we think of it as one letter and there's no exception
[11 Jan 2006 17:11] Valeriy Kravchuk
Thank you for the bug report. Verified just as described on latest 5.0.19-BK on Linux:

mysql> CREATE TABLE words (Word   VARCHAR(40) NOT NULL,   UNIQUE INDEX(Word(40))) ENGINE=MYISAM CHECKSUM=1 CHARACTER SET latin2 COLLATE latin2_croatian_ci;
Query OK, 0 rows affected (0.03 sec)
 
mysql> INSERT INTO words (Word) VALUES ('abc');
Query OK, 1 row affected (0.00 sec)
 
mysql> INSERT INTO words (Word) VALUES ('bbc');
Query OK, 1 row affected (0.00 sec)
 
mysql> INSERT INTO words (Word) VALUES ('čbc');
Query OK, 1 row affected (0.00 sec)
 
mysql> INSERT INTO words (Word) VALUES ('ćbc');
Query OK, 1 row affected, 1 warning (0.00 sec)
 
mysql> INSERT INTO words (Word) VALUES ('zzz');
Query OK, 1 row affected (0.00 sec)
 
mysql> INSERT INTO words (Word) VALUES ('žzz');
Query OK, 1 row affected, 1 warning (0.01 sec)
 
mysql> show warnings;
+---------+------+-------------------------------------------+
| Level   | Code | Message                                   |
+---------+------+-------------------------------------------+
| Warning | 1265 | Data truncated for column 'Word' at row 1 |
+---------+------+-------------------------------------------+
1 row in set (0.00 sec)
 
mysql> select word from words order by word;
+------+
| word |
+------+
| ??zz |
| �?bc |
| abc  |
| čbc |
| bbc  |
| zzz  |
+------+
6 rows in set (0.00 sec)
 
mysql> select version();
+-----------+
| version() |
+-----------+
| 5.0.19    |
+-----------+
1 row in set (0.00 sec)
 
So, there are obvious problems with this collation.
[21 Jan 2006 17:15] Tomislav Rajaković
yeah,i've forgot smthg....

non-croatian letters are: ("Q","X","Y","W"), i forgot "Q"
[15 Mar 2006 16:27] Vlatko Šurlan
Have found some info on this, perhaps even a workarround but haven't tested it:
http://www.ambra.rs.ba/
[19 May 2006 11:28] Alexander Barkov
Dear Tomislav,

That's true, latin2_croatian_ci.html does not support 
double letters (know as "contractions"). This is a simplified
version, which was intentionally written this way and which
provides faster sorting that the version with contractions
would do.

However I do agree that it would be nice to have the "real" Croation
collations. So one will be able to chose between correct sort order
(which is a bit slower) and the faster version (which does not support contractions).
So I added a "Create real Croatian collations" task into our TODO.
Thanks for requesting this feature!

About other letters, I don't agree that it sorts most of the letters in wrong order.
It does sort all accented letters of their proper places, exactly like
you describe:

A,B,C,Č,Ć,D,Đ,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,Š,T,U,V,W,X,Y,Z,Ž

Please see collation chart here:

http://myoffice.izhnet.ru/bar/~bar/charts/latin2_croatian_ci.html

If you get letters in a different order, most likely you have misconfigured
character set settings. Please start checking with "show variables like 'character_set%'"

I'm closing this report as not a bug.