Bug #13421 problem with sorting turkish
Submitted: 23 Sep 2005 10:14 Modified: 9 Dec 2005 0:14
Reporter: Peter Lemmens Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:mysql 4.1.14 OS:Microsoft Windows (Windows 2000 Server)
Assigned to: Alexander Barkov CPU Architecture:Any

[23 Sep 2005 10:14] Peter Lemmens
Description:
MySQL does not correctly handles the "â" character when sorting turkish.
The character gets sorted after the "z" but it must come between the a and the b.

The same goes for î which should come between the i and the j and for for û which should come between the u and the ü.

How to repeat:
CREATE TABLE words
(
  Word   VARCHAR(40) NOT NULL,
  UNIQUE INDEX(Word(40)),
) ENGINE=MYISAM CHECKSUM=1 CHARACTER SET latin5 COLLATE latin5_turkish_ci;
INSERT INTO Words (Word) VALUES ('adet');
INSERT INTO Words (Word) VALUES ('âdet');
INSERT INTO Words (Word) VALUES ('bîat');
INSERT INTO Words (Word) VALUES ('bîaman');
INSERT INTO Words (Word) VALUES ('zaman');
SELECT Word FROM Words ORDER BY Word;
[23 Sep 2005 10:25] Vasily Kishkin
Thanks for the bug report. I was able to reproduce the bug. I attached screen copy of result.
[23 Sep 2005 10:25] Vasily Kishkin
screen copy

Attachment: screencopy.JPG (image/jpeg, text), 56.23 KiB.

[28 Sep 2005 7:18] Alexander Barkov
Peter, you're right, there is something wrong with Turkish collation for Latin5.
We need your suggestion for what these queries should return:

SELECT "a"= "â", "a" < "â";
SELECT "i"="î", "i" < "î";
SELECT "u" = "û", "u" < "û";
SELECT "û" = "ü", "û" < "ü";

Also, how should other accented letters (which are part of latin5 but not used by Tuskish
alphabet) be sorted and compated?  Should the be considered equal to base
un-accented letter? For example:
SELECT "a" = "ä", "a" < "ä";
SELECT "u" = "ù", "u" < "ù";
etc.

Thanks!

P.S.

I believe utf8_turkish_ci works fine.
We can fix latin5_turkish_ci to produce about the same sorting order.
What do you think?
[30 Sep 2005 21:17] Peter Lemmens
When looking at turkish dictionaries you will see the words sorted like this : 

hala, hâlâ, ham

The a is considered kinda equal to â but the a still comes before the â when sorting the same word.

So based on this the query
  SELECT "a" = "ä", "a" < "â"
should return
  1 1

SELECT "a"= "â", "a" < "â";    > 1 1
SELECT "i"="î", "i" < "î";    > 1 1
SELECT "u" = "û", "u" < "û";    > 1 1
SELECT "û" = "ü", "û" < "ü";    > 0 1

I tried the utf8_turkish_ci but i was unable to insert both "hala" and "hâlâ". They were considered duplicate.

I, personally, would treat the other accented letters the same way as â etc.
[25 Oct 2005 7:32] Alexander Barkov
The change has been commited:

http://lists.mysql.com/internals/31429
[7 Dec 2005 14:55] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/11
[7 Dec 2005 15:46] Alexander Barkov
Fixed in 4.1.17 and 5.0.17.

A change log commend suggestion:

Changed order of the letters "A WITH CIRCUMFLEX", "I WITH CIRCUMLEX"
and "U WITH CIRCUMFLEX" in latin5_turkish_ci collation.  Those who have used
these letters must rebuild indexes.
[9 Dec 2005 0:14] Paul DuBois
Noted in 4.1.17, 5.0.17 changelogs.