MySQL Bugs: #13421: problem with sorting turkish

Bug #13421	problem with sorting turkish
Submitted:	23 Sep 2005 10:14	Modified:	9 Dec 2005 0:14
Reporter:	Peter Lemmens	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server	Severity:	S3 (Non-critical)
Version:	mysql 4.1.14	OS:	Windows (Windows 2000 Server)
Assigned to:	Alexander Barkov	CPU Architecture:	Any

Description:
MySQL does not correctly handles the "â" character when sorting turkish.
The character gets sorted after the "z" but it must come between the a and the b.

The same goes for î which should come between the i and the j and for for û which should come between the u and the ü.

How to repeat:
CREATE TABLE words
(
  Word   VARCHAR(40) NOT NULL,
  UNIQUE INDEX(Word(40)),
) ENGINE=MYISAM CHECKSUM=1 CHARACTER SET latin5 COLLATE latin5_turkish_ci;
INSERT INTO Words (Word) VALUES ('adet');
INSERT INTO Words (Word) VALUES ('âdet');
INSERT INTO Words (Word) VALUES ('bîat');
INSERT INTO Words (Word) VALUES ('bîaman');
INSERT INTO Words (Word) VALUES ('zaman');
SELECT Word FROM Words ORDER BY Word;

Thanks for the bug report. I was able to reproduce the bug. I attached screen copy of result.

screen copy

Attachment: screencopy.JPG (image/jpeg, text), 56.23 KiB.

Peter, you're right, there is something wrong with Turkish collation for Latin5.
We need your suggestion for what these queries should return:

SELECT "a"= "â", "a" < "â";
SELECT "i"="î", "i" < "î";
SELECT "u" = "û", "u" < "û";
SELECT "û" = "ü", "û" < "ü";

Also, how should other accented letters (which are part of latin5 but not used by Tuskish
alphabet) be sorted and compated?  Should the be considered equal to base
un-accented letter? For example:
SELECT "a" = "ä", "a" < "ä";
SELECT "u" = "ù", "u" < "ù";
etc.

Thanks!

P.S.

I believe utf8_turkish_ci works fine.
We can fix latin5_turkish_ci to produce about the same sorting order.
What do you think?

When looking at turkish dictionaries you will see the words sorted like this : 

hala, hâlâ, ham

The a is considered kinda equal to â but the a still comes before the â when sorting the same word.

So based on this the query
  SELECT "a" = "ä", "a" < "â"
should return
  1 1

SELECT "a"= "â", "a" < "â";    > 1 1
SELECT "i"="î", "i" < "î";    > 1 1
SELECT "u" = "û", "u" < "û";    > 1 1
SELECT "û" = "ü", "û" < "ü";    > 0 1

I tried the utf8_turkish_ci but i was unable to insert both "hala" and "hâlâ". They were considered duplicate.

I, personally, would treat the other accented letters the same way as â etc.

The change has been commited:

http://lists.mysql.com/internals/31429

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/11

Fixed in 4.1.17 and 5.0.17.

A change log commend suggestion:

Changed order of the letters "A WITH CIRCUMFLEX", "I WITH CIRCUMLEX"
and "U WITH CIRCUMFLEX" in latin5_turkish_ci collation.  Those who have used
these letters must rebuild indexes.

Noted in 4.1.17, 5.0.17 changelogs.