Bug #30675 Incorrect order of letters in result
Submitted: 28 Aug 2007 17:53 Modified: 29 Nov 2007 14:25
Reporter: huji huji Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.0.41 OS:Any
Assigned to: CPU Architecture:Any
Tags: collation, Persian, Unicode

[28 Aug 2007 17:53] huji huji
Description:
When you try to sort the results of a query based on the text in one field in ascending order, and that field has Persian text in it, entered in unicode, the order of the results doesn't match the official order of Persian letters.

The order is wrong with both utf8_persian_ci and utf8_unicode_ci collations. I don't know where the collations are defined so I cannot provide you with a corrected version of them, nevertheless, I would be grateful to help with it.

How to repeat:
If you create a text field with one of the Persian letters in each of its rows, and then sort them in ascending order, the order of the returned letters is incorrect, compared with that in Persian language.
[28 Aug 2007 17:56] huji huji
I should also add that the problem persists even if you use collate in the select query. The only temporary remedy (which doesn't fix the problem totally, but partially) is to use ...order by BINARY(colName) which is logically not a "solution".
[28 Aug 2007 18:05] Sveta Smirnova
Thank you for the report.

Please indicate version of MySQL server you use.

Also please provide repeatable test case: we need CREATE TABLE statement, your character set/collation settings, test data in the table.
[29 Aug 2007 13:35] huji huji
Okay. The version I'm using locally is 5.0.41-community-nt but I have also reproduced the same problem on my hosted web site, which uses 5.0.27-standard.

I have created a database with CHARACTER SET utf8 COLLATION utf8_persian_ci. Then I have create a table like this:

CREATE TABLE test (txt nvarchar(10) primary key);

Then, using PHP, I have added rows to this column, each row containing one Persian letter. Then, I used these commands to select from that table:

SELECT txt from test order by txt ASC;

SELECT txt from test order by BINARY(txt) ASC;

SELECT txt from test order by txt collate utf8_persian_ci;

SELECT txt from test order by txt collate utf8_unicode_ci;

And so on. None of them returns the rows in correct order, per official Persian alphabet order.

Feel free to ask for other information. I can even upload sample code to somewhere, if you ask.
[30 Aug 2007 10:56] Sveta Smirnova
Thank you for the feedback.

Please check if your problem is same as bug #29977 which is feature request really. If not, please indicate which version of Persian alphabet you use.
[30 Aug 2007 19:17] huji huji
No. This bug is not the same as bug #29977. That bug only pertains to a minor portion (minor, but important) of the problems we have with utf8_persian_ci collation. The bug I reported here, is a major one.

You asked "which version of Persian alphabet" I'm using. I'm using the only official version of Persian alphabet. The details can be found here: http://en.wikipedia.org/wiki/Persian_alphabet
[9 Sep 2007 11:22] Roozbeh Pournader
Huji, Would you please post a minimal test case in SQL? Use something like this:

set names utf8;
drop table if exists ...;
create table ...;
insert into ... values ...;
select .... order by .... collate utf8_persian_ci;

Otherwise, we can't really know what has stopped working exactly.
[9 Sep 2007 11:43] huji huji
You already did that, Roozbeh!

I created a database with: .... CHARACTER SET utf8 COLLATE utf8_persian_ci;

Then I created a table (named "test"), with a column named "txt".

I inserted all persian letters into it:

inert into test set txt='ب'; etc

then:

select txt from test ... with all the variations I told you above.
[28 Sep 2007 8:17] huji huji
any progress?
[29 Sep 2007 18:15] Valeriy Kravchuk
Looks like a duplicate of Bug #29977. Please, check.
[30 Sep 2007 9:11] huji huji
Sorry, but no. It was previously told on the same thread too, but this bug is not a duplicated of that bug. It is a more generla issue.
[29 Nov 2007 10:07] Susanne Ebrecht
To summarize the problem:

Generally, for Persian utf8_persian_ci is a good choice for collation. Besides bug #29977 of course.

The problem here is, that you have the opinion, that the data aren't sort in the best way by using utf8_persian_ci. But when you use 'binary()' all will be fine.
Binary() just means, that utf8_bin will be used.

Do I understand this problem in the right way?

Did you consider, that all collations, which ends with _ci are case insensitive?
The "ci" stands for case insensitive and "cs" for case sensitive.

Case insensitive means for example:
'A' is the same as 'a'. When you have the following data:
'abc', 'Ade', 'bcd', 'Bef', 'cde', 'Cfg'

when you use a case insensitive collation, they will sort to:
'abc', 'Ade', 'bcd', 'Bef', 'cde', 'Cfg'

when you use binary collation or case sensitive collation, you will get:
'Ade', 'Bef', 'Cfg', 'abc', 'bcd', 'cde'

Is this your problem? You want a case sensitive collation instead a case insensitive collation?

In this case, just use utf8_bin as collation. 
You also can make your own collation: http://forge.mysql.com/w/images/b/b7/HowToAddACollation.pdf

If it's just the case insensitive mode that confused you, then I want to close this bug. Is this ok?
[29 Nov 2007 12:53] huji huji
Thank you Sussane, for your summarizing the case. There are some corrections needed:

>> But when you use 'binary()' all will be fine.
Nope. However, binary() is the "best available" option. It still puts پ and چ and ژ and گ after all other letters, which is not correct. However, it works much better than utf8_persian_ci in my tests.

>> Did you consider, that all collations, which ends with _ci are case insensitive?
Yes, and as Persian letters do not have upper- and lower-case forms (unlike most Latin based alphabets) this has nothing to do here.

So you see, my problem is not about the case of the letters; it is the order of the letters which should match that of Persian alphabet, and does not.

I will try my best to paste the full procedures of creating a DB, then a TABLE, then filling it, and then running selects, to show you what the problem is (as soon as I find some free time).

Regards,

huji
[29 Nov 2007 14:25] huji huji
I'm closing this bug, because this is not a bug in MySQL. It is a problem with PHP not supporting unicode natively which caused this issue.