Bug #15913 | utf8 collation does not conform to Unicode Collating Sequence Standard | ||
---|---|---|---|
Submitted: | 21 Dec 2005 17:43 | Modified: | 6 Jan 2006 20:04 |
Reporter: | Rand Childs | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Server | Severity: | S2 (Serious) |
Version: | 5.0.17 | OS: | MacOS (Mac OS X 10.4.3) |
Assigned to: | Alexander Barkov | CPU Architecture: | Any |
[21 Dec 2005 17:43]
Rand Childs
[21 Dec 2005 17:45]
Rand Childs
Oops, The sql statment in the submission is obviously wrong. It should be select data from utf8data order by data collate utf8_general_ci;
[21 Dec 2005 17:56]
Aleksey Kishkin
Rand, some characters in the bug report (3rd char in de‐luge ) can't be read here. Could you write hex code of that word?
[21 Dec 2005 18:03]
Rand Childs
The data is in utf8 but there is only one multi-byte character which is the utf8 hyphen which is Unicode Character 2010, utf8 character E2 80 90. The other is the hyphen-plus which is Unicode Character 002D, utf8 character 2D. Hope this answers your question. Rand
[21 Dec 2005 18:08]
Rand Childs
This also might make it clearer: The data: demark deLuge de-Luge de‐Luge de Luge deluge de-luge de‐luge de luge death is this in hex: Typ=1 Len=6: 64,65,6d,61,72,6b Typ=1 Len=6: 64,65,4c,75,67,65 Typ=1 Len=7: 64,65,2d,4c,75,67,65 Typ=1 Len=9: 64,65,e2,80,90,4c,75,67,65 Typ=1 Len=7: 64,65,20,4c,75,67,65 Typ=1 Len=6: 64,65,6c,75,67,65 Typ=1 Len=7: 64,65,2d,6c,75,67,65 Typ=1 Len=9: 64,65,e2,80,90,6c,75,67,65 Typ=1 Len=7: 64,65,20,6c,75,67,65 Typ=1 Len=5: 64,65,61,74,68 using an Oracle "select dump(data,16) from temp" to output each value in hex. This should clearly illustrate each utf8 character. Rand
[21 Dec 2005 18:40]
Aleksey Kishkin
testcase
Attachment: b15913.sql (text/plain), 355 bytes.
[6 Jan 2006 15:38]
Alexander Barkov
This is not a bug. utf8_general_ci is not designed to follow the Unicode Collation Algorithm (UCA) with its standard weighs. It is a simplified collation which uses exactly one weight element for one character, which makes it much faster. It does not support neither contractions nor expansions, nor ignorable characters. Its main purpose is to be as fast as possible. It provides a reasonable sort order for many cases, but you're right it is not always good. You might want to try utf8_unicode_ci instead, which is a more sofisticated collation based on the UCA. It provides a better sort order but is a bit slower. It sorts this way: de luge de Luge de-luge de-Luge de‐luge de‐Luge death deluge deLuge demark The variable weighting characters are implemented to be non-ignorable in utf8_unicode_ci. Thus, space, hyphen and minus-hypen characters do have a primary weight, and so "de-luge" appears before "death", unlike Oracle. According to your Oracle's order example, Oracle seems to treat them differently, maybe ignorable on the primary and secondary levels and not-ignorable on the tertiary level, but I'm not sure. Also, please note, MySQL currently uses only primary level weights. So, sort order of a small letter and a capital letter is generally speaking unpredictable. Supporting secondary and tertiary weights is on our TODO. But as this is something people ask very very rarely, so it is under a low priority.
[6 Jan 2006 20:04]
Rand Childs
Thank you for your answer. I'll take a look at utf8_unicode_ci. Oracle uses primary, secondary, and tertiary levels, so that is why it treats them differently although even Oracle doesn't appear to exactly match the Unicode Collating Sequence/ISO 14651 standards. I would like to add my vote for MySql to fully supporting the UCS/ISO 14651 collating sequence standards. Our object is to be able to store and display data in multiple langauges using the UTF-8 character set and we would like to order the data from these combined languages according to a standard collating sequence standard. We also want to be able to create and use indexes that are ordered using this standard. It would also be very helpful to be able to easily define variations to this standard for users that want slight variations to the standard to provide better support for their particular primary language. Oracle for example has something called "locale builder" which provides this functionality. Rand H. Childs