Bug #15913 utf8 collation does not conform to Unicode Collating Sequence Standard
Submitted: 21 Dec 2005 17:43 Modified: 6 Jan 2006 20:04
Reporter: Rand Childs Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:5.0.17 OS:MacOS (Mac OS X 10.4.3)
Assigned to: Alexander Barkov CPU Architecture:Any

[21 Dec 2005 17:43] Rand Childs
Description:
Yesterday I installed mysql 5.0.17 on my Mac OS X 10.4.3 system and am trying to understand how utf8 collating works as defined by the UCS (Unicode Collating Sequence Standard) and how it compares to Oracle's GENERIC_M. 

I entered the following list from the UCS report http://www.unicode.org/reports/tr10/ in the section discussing variable weighting in both Oracle configured for utf8 and GENERIC_M and myql 5.0.17 configured for utf8 and utf8_general_ci. Oracle collates the data as follows: 

death 
deluge 
deLuge 
de luge 
de Luge 
de-luge 
de-Luge 
de‐luge 
de‐Luge 
demark 

whereas mysql collates the data as follows: 

+-----------+ 
| de luge | 
| de Luge | 
| de-Luge | 
| de-luge | 
| death | 
| deluge | 
| deLuge | 
| demark | 
| de‐Luge | 
| de‐luge | 
+-----------+ 

and neither collations agree with the table in "variable weighting" for any of the 4 possible variable weighting options. In fact mysql just looks wrong. For example the capital L collates before the lower case l if preceeded by a hyphen-minus or hyphen. In Oracle the hyphen-minus (002D) and the hyphen (2010) appear to collate together after the space in a tertiary position. mysql on the other hand make the space significant and doesn't collate the two hyphen's together. 

How to repeat:
I set up a database, table (type INNODB), and column to all have character set utf8 and collate utf8_general_ci.  Insert the following data in reverse or random order into a varchar(20) column (data) in a table(utf8data) in a database.  Then do an sql query of the column in the table with an order by clause, adding collate if you wish, e.g.,

select data from utf8data collate utf8_general_ci;

Compare the results to both the results produced by Oracle for GENERIC_M and documented in the Unicode Collating Sequence Standard section on "variable weighting" as described here http://www.unicode.org/reports/tr10/ which documents the data.

Note that the data has both a hyphen-minus (002D) and a hypen (2010).  Also note that in mysql the upper case L seems to proceed the lower case l for data that contains one of the hyphens which appears to me to be obviously incorrect for ascending ordering.

Would also suggest that all utf8 collating sequences be check for correct ordering according to the UCS/ISO 14651 unicode collating sequence standards modified for specific languages.

Suggested fix:
Fix utf8 collating to collate correctly based on the Unicode Collating Sequence/ISO 14651 standards.
[21 Dec 2005 17:45] Rand Childs
Oops,

The sql statment in the submission is obviously wrong.  It should be

select data from utf8data order by data collate utf8_general_ci;
[21 Dec 2005 17:56] Aleksey Kishkin
Rand, some characters in the bug report (3rd char in  de‐luge ) can't be read here. Could you write hex code of that word?
[21 Dec 2005 18:03] Rand Childs
The data is in utf8 but there is only one multi-byte character which is the utf8 hyphen which is Unicode Character 2010, utf8 character E2 80 90.  The other is the hyphen-plus which is Unicode Character 002D, utf8 character 2D.

Hope this answers your question.

Rand
[21 Dec 2005 18:08] Rand Childs
This also might make it clearer:

The data:

demark
deLuge
de-Luge
de‐Luge
de Luge
deluge
de-luge
de‐luge
de luge
death

is this in hex:

Typ=1 Len=6: 64,65,6d,61,72,6b
Typ=1 Len=6: 64,65,4c,75,67,65
Typ=1 Len=7: 64,65,2d,4c,75,67,65
Typ=1 Len=9: 64,65,e2,80,90,4c,75,67,65
Typ=1 Len=7: 64,65,20,4c,75,67,65
Typ=1 Len=6: 64,65,6c,75,67,65
Typ=1 Len=7: 64,65,2d,6c,75,67,65
Typ=1 Len=9: 64,65,e2,80,90,6c,75,67,65
Typ=1 Len=7: 64,65,20,6c,75,67,65
Typ=1 Len=5: 64,65,61,74,68

using an Oracle "select dump(data,16) from temp" to output each value in hex.

This should clearly illustrate each utf8 character.

Rand
[21 Dec 2005 18:40] Aleksey Kishkin
testcase

Attachment: b15913.sql (text/plain), 355 bytes.

[6 Jan 2006 15:38] Alexander Barkov
This is not a bug. utf8_general_ci is not designed to follow the Unicode
Collation Algorithm (UCA) with its standard weighs. It is a simplified collation
which uses exactly one weight element for one character, which makes it
much faster. It does not support neither contractions nor expansions, nor
ignorable characters. Its main purpose is to be as fast as possible. It provides
a reasonable sort order for many cases, but you're right it is not always good.

You might want to try utf8_unicode_ci instead, which is a more
sofisticated collation based on the UCA. It provides a better
sort order but is a bit slower. It sorts this way:

de luge
de Luge
de-luge
de-Luge
de‐luge
de‐Luge
death
deluge
deLuge
demark

The variable weighting characters are implemented to be non-ignorable in utf8_unicode_ci.
Thus, space, hyphen and minus-hypen characters do have a primary weight, and so 
"de-luge" appears before  "death", unlike Oracle.
According to your Oracle's order example, Oracle seems to treat them differently,
maybe ignorable on the primary and secondary levels and not-ignorable on
the tertiary level, but I'm not sure.

Also, please note, MySQL currently uses only primary level weights.
So, sort order of a small letter and a capital letter is generally speaking
unpredictable. Supporting secondary and tertiary weights is on our TODO.
But as this is something people ask very very rarely, so it is under a low priority.
[6 Jan 2006 20:04] Rand Childs
Thank you for your answer.  I'll take a look at utf8_unicode_ci.  Oracle uses primary, secondary, and tertiary levels, so that is why it treats them differently although even Oracle doesn't appear to exactly match the Unicode Collating Sequence/ISO 14651 standards.

I would like to add my vote for MySql to fully supporting the UCS/ISO 14651 collating sequence standards.  Our object is to be able to store and display data in multiple langauges using the UTF-8 character set and we would like to order the data from these combined languages according to a standard collating sequence standard.  We also want to be able to create and use indexes that are ordered using this standard. 

It would also be very helpful to be able to easily define variations to this standard for users that want slight variations to the standard to provide better support for their particular primary language.  Oracle for example has something called "locale builder" which provides this functionality.

Rand H. Childs