Bug #11369 Unicode ucs2_unicode_ci collation isn't following Unicode guideline
Submitted: 15 Jun 2005 23:35 Modified: 22 Jun 2005 8:18
Reporter: Peter Gulutzan Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:5.0.8-beta-debug OS:Any (*)
Assigned to: Alexander Barkov

[15 Jun 2005 23:35] Peter Gulutzan
Description:
The MySQL Reference Manual says that the utf8_unicode_ci and
ucs2_unicode_ci collations are "according to the Unicode Collation
Algorithm (UCA)" with weights as in http://www.unicode.org/Public/UCA/latest/allkeys.txt.

However, I see some characters that are not ordered according to
the primary weight designated in allkeys.txt. These characters include:
00c6 LATIN CAPITAL LETTER AE (should sort with A)
0141 LATIN CAPITAL LETTER L WITH STROKE (should sort with L)
00d8 LATIN CAPITAL LETTER O WITH STROKE (should sort with O)
00df LATIN SMALL LETTER SHARP S (should sort with S)
I did not check all code points.

How to repeat:
mysql> create table tuca (s1 char(2) character set ucs2);
Query OK, 0 rows affected (0.00 sec)

mysql> insert into tuca values ('L'),('LA'),(0x0141) /* 0141 = L with stroke */;
Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> select s1, hex(s1) from tuca order by s1 collate ucs2_unicode_ci;
+------+----------+
| s1   | hex(s1)  |
+------+----------+
| L    | 004C     |
| LA   | 004C0041 |
| Ł   | 0141     |
+------+----------+
3 rows in set (0.00 sec)

Suggested fix:
Check all code points.
[20 Jun 2005 10:14] Alexander Barkov
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.mysql.com/documentation/ and the instructions on
how to report a bug at http://bugs.mysql.com/how-to-report.php

Additional info:

This is not a bug. MySQL UCA-based collations were built
using allkeys.txt from Unicode version 4.0.0. In this version,
LATIN LETTER L WITH STROKE differs from basic LATIN LETTER L
on primary level:

006C  ; [.0F2E.0020.0002.006C] # LATIN SMALL LETTER L
004C  ; [.0F2E.0020.0008.004C] # LATIN CAPITAL LETTER L
0142  ; [.0F36.0020.0002.0142] # LATIN SMALL LETTER L WITH STROKE
0141  ; [.0F36.0020.0008.0141] # LATIN CAPITAL LETTER L WITH STROKE

In later versions Unicode has changed the weights. In 4.1.0 these letters
have the same weight on primary level.

006C  ; [.1077.0020.0002.006C] # LATIN SMALL LETTER L
004C  ; [.1077.0020.0008.004C] # LATIN CAPITAL LETTER L
0142  ; [.1077.0020.0002.006C][.0000.008D.0002.0335] # LATIN SMALL LETTER L WITH STROKE
0141  ; [.1077.0020.0008.004C][.0000.008D.0002.0335] # LATIN CAPITAL LETTER L WITH STROKE

However, we don't change our collations every time Unicode releases
next versions of the standard.

Probably we should mention in the manual that we use Unicode-4.0.0
for our UCA collations.
[22 Jun 2005 8:18] Jon Stephens
Thank you for your bug report. This issue has been addressed in the
documentation. The updated documentation will appear on our website
shortly, and will be included in the next release of the relevant
product(s).

Additional info:

This appears to have been fixed in http://dev.mysql.com/doc/mysql/en/charset-unicode-sets.html - marking as Documented/Closed.