Bug #29977 MySQL Persian collation (utf8_persian_ci) incorrectly sorts Harakat
Submitted: 23 Jul 2007 13:28 Modified: 29 Sep 2007 18:14
Reporter: Roozbeh Pournader Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Charsets Severity:S4 (Feature request)
Version:5.0.22 OS:Linux (CentOS 5)
Assigned to: Assigned Account CPU Architecture:Any
Triage: Triaged: D5 (Feature request)

[23 Jul 2007 13:28] Roozbeh Pournader
Description:
The Persian collation implemented in MySQL at utf8_persian_ci incorrectly values Harakat (combining Unicode characters from U+064B to U+0652 and more) as normal letters in collation. There are also several minor differences from the accepted standard, as implemented in glibc and ICU.

The error is reflected in both the collation rules set in "strings/ctype-uca.c", where everything is simply separated with a "<" (while there should be "<<"s and "<<<"s around), and the test data set in mysql-test/r/ctype_uca.result, specically the strings 063306500631 and 064706500646064A064606AF, where the strings are sorted as if U+0650 was a letter by itself, instead of a mere diacritic of thir.

(Random credits: I am the author of the collation data in the GNU C library, the ICU library, the only publicly available textual specification for computer collation of Persian strings, and also a contributor to the Unicode Standard and the Unicode Collation Algorithm, which mySQL is implementing.)

How to repeat:
Change mysql-test/r/ctype_uca.result to reflect the correct ordering for hte Persian strings and run "make test"!

Suggested fix:
I am working on a patch for both of the files (the code and the test data) and an extended test data, trying to synchronize the sorting with glibc and ICU as much as possible.
[7 Aug 2007 20:59] Miguel Solorzano
Bug: http://bugs.mysql.com/bug.php?id=30277 was marked as duplicate of
this one.
[24 Aug 2007 13:40] Peter Gulutzan
Jody McIntyre, who submitted the Persian patch, did
check ICU. See
http://lists.mysql.com/internals/15841

These are the Unicode characters between 064B and 052:
064B;ARABIC FATHATAN
064C;ARABIC DAMMATAN
064D;ARABIC KASRATAN
064E;ARABIC FATHA
064F;ARABIC DAMMA
0650;ARABIC KASRA
0651;ARABIC SHADDA
0652;ARABIC SUKUN
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
 
Apparently those characters are Harakat, which are
vowel marks, and are therefore like Hebrew niqqud.
(There is a reference-manual comment about niqqud here:
http://dev.mysql.com/doc/refman/5.1/en/charset-unicode-sets.html
).

Harakat are combining characters, so Bug#29977 is actually
a feature request rather than a bug. Accordingly, I am changing
the severity to S4.

MySQL has two worklog tasks outstanding:
- WL#898 Primary, Secondary and Tertiary Sorts
- WL#3770 Unicode-compliant comparison and sorting of combining characters
Those are not marked as 'private', so they are probably visible on
forge.mysql.com, or they soon will be.

Without first working on those tasks, Harakat may be difficult,
but we wish the best of luck to patch submitters.

Incidentally, I think Bug#30277 "Collation for Persian letters"
in fact is not a duplicate, it is a non-bug. It appears
that the writer of Bug#30277 was merely unaware that
utf8_persian_ci exists.
[29 Sep 2007 18:14] Valeriy Kravchuk
Thank you for a reasonable feature request.