Bug #28216 sorting multi-word strings using utf8
Submitted: 3 May 2007 8:56 Modified: 1 Jun 2007 18:37
Reporter: Axel Schwenke Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Documentation Severity:S2 (Serious)
Version:5.0.42, 5.1.18 OS:Any
Assigned to: Paul DuBois CPU Architecture:Any
Tags: bfsm_2007_05_31, sorting, utf8

[3 May 2007 8:56] Axel Schwenke
Description:
Sorting strings with multiple words yields wrong results with most utf8 collations. 

Example: Sorting "foo bar", "foo, bar" and "foo-bar" using different collations

latin1 (all collations)

+----------+
| c1       |
+----------+
| foo bar  | 
| foo, bar | 
| foo-bar  | 
+----------+

utf8_bin, utf8_general_ci

+----------+
| c1       |
+----------+
| foo bar  | 
| foo, bar | 
| foo-bar  | 
+----------+

other utf8 collations

+----------+
| c1       |
+----------+
| foo bar  | 
| foo-bar  | 
| foo, bar | 
+----------+

IMNSHO this last sort order is wrong. At least it is inconsistent with latin1 sort order.

How to repeat:
Run the attached testcases:

mysql> \. testcase.latin1
mysql> \. testcase.utf8

Suggested fix:
correct the sort order for the specialized utf8 collations
[3 May 2007 8:57] Axel Schwenke
SQL statements to demonstrate latin1 behaviour

Attachment: testcase.latin1 (application/octet-stream, text), 575 bytes.

[3 May 2007 8:57] Axel Schwenke
SQL statements to demonstrate utf8 behaviour

Attachment: testcase.utf8 (application/octet-stream, text), 1.24 KiB.

[3 May 2007 8:57] Axel Schwenke
SQL statements to demonstrate utf8 behaviour

Attachment: testcase.utf8 (application/octet-stream, text), 1.24 KiB.

[10 May 2007 12:55] Sergei Golubchik
Thank you for taking the time to write to us, but this is not a bug. Please double-check the documentation available at http://dev.mysql.com/doc/.

The Default Unicode Collation Element Table ( http://www.unicode.org/Public/UCA/latest/allkeys.txt) of the Unicode Collation Algorithm (http://www.unicode.org/reports/tr10/) contains the following lines:

002C  ; [*0232.0020.0002.002C] # COMMA
002D  ; [*0222.0020.0002.002D] # HYPHEN-MINUS

As you can see,  the COMMA character (ASCII 0x2c, ',') has higher weight that HYPHEN-MINUS character (ASCII 0x2D, '-')
[29 May 2007 12:57] Valeriy Kravchuk
OK, but how that quote from the specification explains this difference, utf8_bin, utf8_general_ci vs. other utf8 collations?

utf8_bin, utf8_general_ci

+----------+
| c1       |
+----------+
| foo bar  | 
| foo, bar | 
| foo-bar  | 
+----------+

other utf8 collations

+----------+
| c1       |
+----------+
| foo bar  | 
| foo-bar  | 
| foo, bar | 
+----------+
[29 May 2007 13:31] Sergei Golubchik
Thank you for taking the time to write to us, but this is not a bug. Please double-check the documentation available at http://dev.mysql.com/doc/ and the instructions on how to report a bug at http://bugs.mysql.com/how-to-report.php

This quote doesn't. Our manual does.
utf8_unicode_* sollations implement [a subset of] UCA - the Unicode Collation Algorithm. See the quote above.

utf8_general_* collations do not implement UCA, the above quote from the standard does not apply to them.
[29 May 2007 15:09] Valeriy Kravchuk
Sorry, but I still do not see anything in http://dev.mysql.com/doc/refman/5.1/en/charset-unicode-sets.html that explains how utf8_bin should work, or how implementation (or no implementation) of UCA affects punctuation characters. So, I still think it is a valid documentation request, at least.
[31 May 2007 21:07] Peter Gulutzan
The manual says utf8_general_ci is a "legacy" collation.
Changing a collation could require index rebuilds,
there are no immediate plans.

The utf8_general_ci was produced before Bar added the
UCA-based collations in 2004. I have old emails with
denigratory quotes about utf8_general_ci ("useless for
Greek and bad for some other languages including German"
"useless for most people" "the flaws in utf8_general_ci
and ucs2_general_ci are noticeable" "utf8_general_ci is
bad for Ukrainian and bad for Serbian"). I sent an email
to dev-public on 2005-02-28, thread "Re: Need advice on bug 8608"
"I guess there are three opinions about the default Unicode collation.
1. utf8_general_ci, ucs2_general_ci
   Advantage: it's current default, it's faster than utf8_unicode_ci.
   Barkov votes for "1.", I believe.
2. utf8_swedish_ci, ucs2_swedish_ci
   Advantage: comparable to latin1_swedish_ci the 8-bit default
   Gulutzan votes for "2.".
3. utf8_unicode_ci, ucs2_unicode_ci
   Advantage: results are more correct than utf8_general_ci's
   Milivojevic votes for "3.", I believe.
I wonder whether anyone else has a strong opinion about this matter."
Nobody responded, so there was no consensus, so we stayed
with the utf8_general_ci default.

One of these days we should try again, I suppose. But it's no server
bug if different collations are different collations, and it's no
documentation bug if the manual doesn't list every Unicode character
for every collation.
[1 Jun 2007 18:37] Paul DuBois
Thank you for your bug report. This issue has been addressed in the documentation. The updated documentation will appear on our website shortly, and will be included in the next release of the relevant products.

As Peter says, collations differ. I will add a paragraph to the charset chapter to indicate that it's best if you test a collation to verify that it sorts as you expect.

      In cases where a character set has multiple collations, it might
      not be clear which collation is most suitable for a given
      application. To avoid choosing the wrong collation, it can be
      helpful to perform some comparisons with representative data
      values to make sure that a given collation sorts values the way
      you expect.