Bug #30277 Collation for Persian letters
Submitted: 7 Aug 2007 16:44 Modified: 24 Aug 2007 11:38
Reporter: huji huji Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.x OS:Any
Assigned to: CPU Architecture:Any
Tags: collation, Persian, Unicode, UTF-8

[7 Aug 2007 16:44] huji huji
Description:
With the popularity of unicode character set (in particular, utf-8 character set), many developers use it as their default character set on their MySQL databases. This includes those dealing with content in Persian language.

The portion of utf-8 which references Arabic and Persian languages, doesn't fully support the correct order of letters in Persian. Persian language has four letters more than Arabic (souding as P, Ch, Zh, and G) and, besides, the standard characters used for two letters (which sound K and Y) are different in Persian, compared to Arabic. Unfortunately, the order of the characters in utf-8 charset, is "Arabics first, then the Persian-specific ones".

How to repeat:
Create a table with one column; in each row of it, add one of the Persian letters, as follows:

ا ب پ ت ث ج چ ح خ د ذ ر ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن و ه ی

Then, perform a SELECT column_name FROM table_name ORDER BY column_name ASC. You will se, the result is shown in a different order: پ چ ژ گ are shown after all the others.

Another example is this: Go to Persian Wikipedia, and find a category with lots of articles in it. You will notice that, articles are sorted in the wrong order, with those starting with پ or چ or ژ or گ after all the others.

Suggested fix:
I think there are few approaches available:

- The order of the utf-8 charest (collation?) can be changed. This will have no disadvantage by itself (for example, those working with Arabic content will see no change). The disadvantage (if we call it one) is that the utf-8 charset will not be standard anymore. However, as it is not possible to change the utf-8 charset (it is pre-decided), it is the only rememdy. Or...

- Another charset should be created and bundled with MySQL, which can be named utf-8-Persian-fix for example. This way, the standard utf-8 will not be touched, and, people working with Persian content can change collations in the beginning very easily.

- The third solution, is to have a page on documentation/bugs/anywhere, where the collation fix is made available for download, as well as instructions for applying it. This way, at least those Persian developers who are wise enough, will find the documenation page and apply the remedy.

I'm out of any better idea for now.
[7 Aug 2007 18:22] MySQL Verification Team
QB select order by

Attachment: Persian-sort.png (image/png, text), 18.42 KiB.

[7 Aug 2007 18:31] MySQL Verification Team
Thank you for the bug report. I don't known nothing about Persian language,
could you please tell me if the figure attached shows what are you reporting?
Thanks in advance.
[7 Aug 2007 20:09] huji huji
Well, I noticed there is a collation named utf8_persian_ci, and there are bugs (like http://bugs.mysql.com/bug.php?id=29977&files=1) in it yet. Anyways, I consider my question answered, and want t close it. The benefit of it was, the next person who searches the same thing, will find it easily.
[7 Aug 2007 20:57] MySQL Verification Team
Thank you for the feedback. I am marking this bug as duplicate of:
http://bugs.mysql.com/bug.php?id=29977.