MySQL Bugs: #28216: sorting multi-word strings using utf8

Bug #28216	sorting multi-word strings using utf8
Submitted:	3 May 2007 8:56	Modified:	1 Jun 2007 18:37
Reporter:	Axel Schwenke	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Documentation	Severity:	S2 (Serious)
Version:	5.0.42, 5.1.18	OS:	Any
Assigned to:	Paul DuBois	CPU Architecture:	Any
Tags:	bfsm_2007_05_31, sorting, utf8

Description:
Sorting strings with multiple words yields wrong results with most utf8 collations. 

Example: Sorting "foo bar", "foo, bar" and "foo-bar" using different collations

latin1 (all collations)

+----------+
| c1       |
+----------+
| foo bar  | 
| foo, bar | 
| foo-bar  | 
+----------+

utf8_bin, utf8_general_ci

+----------+
| c1       |
+----------+
| foo bar  | 
| foo, bar | 
| foo-bar  | 
+----------+

other utf8 collations

+----------+
| c1       |
+----------+
| foo bar  | 
| foo-bar  | 
| foo, bar | 
+----------+

IMNSHO this last sort order is wrong. At least it is inconsistent with latin1 sort order.

How to repeat:
Run the attached testcases:

mysql> \. testcase.latin1
mysql> \. testcase.utf8

Suggested fix:
correct the sort order for the specialized utf8 collations

SQL statements to demonstrate latin1 behaviour

Attachment: testcase.latin1 (application/octet-stream, text), 575 bytes.

SQL statements to demonstrate utf8 behaviour

Attachment: testcase.utf8 (application/octet-stream, text), 1.24 KiB.

Thank you for taking the time to write to us, but this is not a bug. Please double-check the documentation available at http://dev.mysql.com/doc/.

The Default Unicode Collation Element Table ( http://www.unicode.org/Public/UCA/latest/allkeys.txt) of the Unicode Collation Algorithm (http://www.unicode.org/reports/tr10/) contains the following lines:

002C  ; [*0232.0020.0002.002C] # COMMA
002D  ; [*0222.0020.0002.002D] # HYPHEN-MINUS

As you can see,  the COMMA character (ASCII 0x2c, ',') has higher weight that HYPHEN-MINUS character (ASCII 0x2D, '-')

OK, but how that quote from the specification explains this difference, utf8_bin, utf8_general_ci vs. other utf8 collations?

utf8_bin, utf8_general_ci

+----------+
| c1       |
+----------+
| foo bar  | 
| foo, bar | 
| foo-bar  | 
+----------+

other utf8 collations

+----------+
| c1       |
+----------+
| foo bar  | 
| foo-bar  | 
| foo, bar | 
+----------+

Thank you for taking the time to write to us, but this is not a bug. Please double-check the documentation available at http://dev.mysql.com/doc/ and the instructions on how to report a bug at http://bugs.mysql.com/how-to-report.php

This quote doesn't. Our manual does.
utf8_unicode_* sollations implement [a subset of] UCA - the Unicode Collation Algorithm. See the quote above.

utf8_general_* collations do not implement UCA, the above quote from the standard does not apply to them.

Sorry, but I still do not see anything in http://dev.mysql.com/doc/refman/5.1/en/charset-unicode-sets.html that explains how utf8_bin should work, or how implementation (or no implementation) of UCA affects punctuation characters. So, I still think it is a valid documentation request, at least.

The manual says utf8_general_ci is a "legacy" collation.
Changing a collation could require index rebuilds,
there are no immediate plans.

The utf8_general_ci was produced before Bar added the
UCA-based collations in 2004. I have old emails with
denigratory quotes about utf8_general_ci ("useless for
Greek and bad for some other languages including German"
"useless for most people" "the flaws in utf8_general_ci
and ucs2_general_ci are noticeable" "utf8_general_ci is
bad for Ukrainian and bad for Serbian"). I sent an email
to dev-public on 2005-02-28, thread "Re: Need advice on bug 8608"
"I guess there are three opinions about the default Unicode collation.
1. utf8_general_ci, ucs2_general_ci
   Advantage: it's current default, it's faster than utf8_unicode_ci.
   Barkov votes for "1.", I believe.
2. utf8_swedish_ci, ucs2_swedish_ci
   Advantage: comparable to latin1_swedish_ci the 8-bit default
   Gulutzan votes for "2.".
3. utf8_unicode_ci, ucs2_unicode_ci
   Advantage: results are more correct than utf8_general_ci's
   Milivojevic votes for "3.", I believe.
I wonder whether anyone else has a strong opinion about this matter."
Nobody responded, so there was no consensus, so we stayed
with the utf8_general_ci default.

One of these days we should try again, I suppose. But it's no server
bug if different collations are different collations, and it's no
documentation bug if the manual doesn't list every Unicode character
for every collation.

Thank you for your bug report. This issue has been addressed in the documentation. The updated documentation will appear on our website shortly, and will be included in the next release of the relevant products.

As Peter says, collations differ. I will add a paragraph to the charset chapter to indicate that it's best if you test a collation to verify that it sorts as you expect.

      In cases where a character set has multiple collations, it might
      not be clear which collation is most suitable for a given
      application. To avoid choosing the wrong collation, it can be
      helpful to perform some comparisons with representative data
      values to make sure that a given collation sorts values the way
      you expect.