MySQL Bugs: #11369: Unicode ucs2_unicode_ci collation isn't following Unicode guideline

Bug #11369	Unicode ucs2_unicode_ci collation isn't following Unicode guideline
Submitted:	15 Jun 2005 23:35	Modified:	22 Jun 2005 8:18
Reporter:	Peter Gulutzan	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server	Severity:	S3 (Non-critical)
Version:	5.0.8-beta-debug	OS:	Any (*)
Assigned to:	Alexander Barkov	CPU Architecture:	Any

Description:
The MySQL Reference Manual says that the utf8_unicode_ci and
ucs2_unicode_ci collations are "according to the Unicode Collation
Algorithm (UCA)" with weights as in http://www.unicode.org/Public/UCA/latest/allkeys.txt.

However, I see some characters that are not ordered according to
the primary weight designated in allkeys.txt. These characters include:
00c6 LATIN CAPITAL LETTER AE (should sort with A)
0141 LATIN CAPITAL LETTER L WITH STROKE (should sort with L)
00d8 LATIN CAPITAL LETTER O WITH STROKE (should sort with O)
00df LATIN SMALL LETTER SHARP S (should sort with S)
I did not check all code points.

How to repeat:
mysql> create table tuca (s1 char(2) character set ucs2);
Query OK, 0 rows affected (0.00 sec)

mysql> insert into tuca values ('L'),('LA'),(0x0141) /* 0141 = L with stroke */;
Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> select s1, hex(s1) from tuca order by s1 collate ucs2_unicode_ci;
+------+----------+
| s1   | hex(s1)  |
+------+----------+
| L    | 004C     |
| LA   | 004C0041 |
| Ł   | 0141     |
+------+----------+
3 rows in set (0.00 sec)

Suggested fix:
Check all code points.

Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.mysql.com/documentation/ and the instructions on
how to report a bug at http://bugs.mysql.com/how-to-report.php

Additional info:

This is not a bug. MySQL UCA-based collations were built
using allkeys.txt from Unicode version 4.0.0. In this version,
LATIN LETTER L WITH STROKE differs from basic LATIN LETTER L
on primary level:

006C  ; [.0F2E.0020.0002.006C] # LATIN SMALL LETTER L
004C  ; [.0F2E.0020.0008.004C] # LATIN CAPITAL LETTER L
0142  ; [.0F36.0020.0002.0142] # LATIN SMALL LETTER L WITH STROKE
0141  ; [.0F36.0020.0008.0141] # LATIN CAPITAL LETTER L WITH STROKE

In later versions Unicode has changed the weights. In 4.1.0 these letters
have the same weight on primary level.

006C  ; [.1077.0020.0002.006C] # LATIN SMALL LETTER L
004C  ; [.1077.0020.0008.004C] # LATIN CAPITAL LETTER L
0142  ; [.1077.0020.0002.006C][.0000.008D.0002.0335] # LATIN SMALL LETTER L WITH STROKE
0141  ; [.1077.0020.0008.004C][.0000.008D.0002.0335] # LATIN CAPITAL LETTER L WITH STROKE

However, we don't change our collations every time Unicode releases
next versions of the standard.

Probably we should mention in the manual that we use Unicode-4.0.0
for our UCA collations.

Thank you for your bug report. This issue has been addressed in the
documentation. The updated documentation will appear on our website
shortly, and will be included in the next release of the relevant
product(s).

Additional info:

This appears to have been fixed in http://dev.mysql.com/doc/mysql/en/charset-unicode-sets.html - marking as Documented/Closed.