Bug #28216 | sorting multi-word strings using utf8 | ||
---|---|---|---|
Submitted: | 3 May 2007 8:56 | Modified: | 1 Jun 2007 18:37 |
Reporter: | Axel Schwenke | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Documentation | Severity: | S2 (Serious) |
Version: | 5.0.42, 5.1.18 | OS: | Any |
Assigned to: | Paul DuBois | CPU Architecture: | Any |
Tags: | bfsm_2007_05_31, sorting, utf8 |
[3 May 2007 8:56]
Axel Schwenke
[3 May 2007 8:57]
Axel Schwenke
SQL statements to demonstrate latin1 behaviour
Attachment: testcase.latin1 (application/octet-stream, text), 575 bytes.
[3 May 2007 8:57]
Axel Schwenke
SQL statements to demonstrate utf8 behaviour
Attachment: testcase.utf8 (application/octet-stream, text), 1.24 KiB.
[3 May 2007 8:57]
Axel Schwenke
SQL statements to demonstrate utf8 behaviour
Attachment: testcase.utf8 (application/octet-stream, text), 1.24 KiB.
[10 May 2007 12:55]
Sergei Golubchik
Thank you for taking the time to write to us, but this is not a bug. Please double-check the documentation available at http://dev.mysql.com/doc/. The Default Unicode Collation Element Table ( http://www.unicode.org/Public/UCA/latest/allkeys.txt) of the Unicode Collation Algorithm (http://www.unicode.org/reports/tr10/) contains the following lines: 002C ; [*0232.0020.0002.002C] # COMMA 002D ; [*0222.0020.0002.002D] # HYPHEN-MINUS As you can see, the COMMA character (ASCII 0x2c, ',') has higher weight that HYPHEN-MINUS character (ASCII 0x2D, '-')
[29 May 2007 12:57]
Valeriy Kravchuk
OK, but how that quote from the specification explains this difference, utf8_bin, utf8_general_ci vs. other utf8 collations? utf8_bin, utf8_general_ci +----------+ | c1 | +----------+ | foo bar | | foo, bar | | foo-bar | +----------+ other utf8 collations +----------+ | c1 | +----------+ | foo bar | | foo-bar | | foo, bar | +----------+
[29 May 2007 13:31]
Sergei Golubchik
Thank you for taking the time to write to us, but this is not a bug. Please double-check the documentation available at http://dev.mysql.com/doc/ and the instructions on how to report a bug at http://bugs.mysql.com/how-to-report.php This quote doesn't. Our manual does. utf8_unicode_* sollations implement [a subset of] UCA - the Unicode Collation Algorithm. See the quote above. utf8_general_* collations do not implement UCA, the above quote from the standard does not apply to them.
[29 May 2007 15:09]
Valeriy Kravchuk
Sorry, but I still do not see anything in http://dev.mysql.com/doc/refman/5.1/en/charset-unicode-sets.html that explains how utf8_bin should work, or how implementation (or no implementation) of UCA affects punctuation characters. So, I still think it is a valid documentation request, at least.
[31 May 2007 21:07]
Peter Gulutzan
The manual says utf8_general_ci is a "legacy" collation. Changing a collation could require index rebuilds, there are no immediate plans. The utf8_general_ci was produced before Bar added the UCA-based collations in 2004. I have old emails with denigratory quotes about utf8_general_ci ("useless for Greek and bad for some other languages including German" "useless for most people" "the flaws in utf8_general_ci and ucs2_general_ci are noticeable" "utf8_general_ci is bad for Ukrainian and bad for Serbian"). I sent an email to dev-public on 2005-02-28, thread "Re: Need advice on bug 8608" "I guess there are three opinions about the default Unicode collation. 1. utf8_general_ci, ucs2_general_ci Advantage: it's current default, it's faster than utf8_unicode_ci. Barkov votes for "1.", I believe. 2. utf8_swedish_ci, ucs2_swedish_ci Advantage: comparable to latin1_swedish_ci the 8-bit default Gulutzan votes for "2.". 3. utf8_unicode_ci, ucs2_unicode_ci Advantage: results are more correct than utf8_general_ci's Milivojevic votes for "3.", I believe. I wonder whether anyone else has a strong opinion about this matter." Nobody responded, so there was no consensus, so we stayed with the utf8_general_ci default. One of these days we should try again, I suppose. But it's no server bug if different collations are different collations, and it's no documentation bug if the manual doesn't list every Unicode character for every collation.
[1 Jun 2007 18:37]
Paul DuBois
Thank you for your bug report. This issue has been addressed in the documentation. The updated documentation will appear on our website shortly, and will be included in the next release of the relevant products. As Peter says, collations differ. I will add a paragraph to the charset chapter to indicate that it's best if you test a collation to verify that it sorts as you expect. In cases where a character set has multiple collations, it might not be clear which collation is most suitable for a given application. To avoid choosing the wrong collation, it can be helpful to perform some comparisons with representative data values to make sure that a given collation sorts values the way you expect.