Bug #33077 Character sets: weight of supplementary characters is not 0xfffd
Submitted: 7 Dec 2007 20:34 Modified: 27 Mar 2008 19:00
Reporter: Peter Gulutzan Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:6.0.5-alpha-debug OS:Linux (SUSE 10 64-bit)
Assigned to: Alexander Barkov CPU Architecture:Any

[7 Dec 2007 20:34] Peter Gulutzan
Description:
The specification for the character sets with supplementary chararcters,
WL#1213 Implement 4-byte UTF8, UTF16 and UTF32,
says
"
Until WL#2673 "Unicode Collation Algorithm new version"
is complete, supplementary characters will all have the
same value for collating weights. For "general"
collations they'll be treated uas U+FFFD REPLACEMENT
CHARACTER. For "uca" collations their value for
weighting will be 0xfffd. Thus supplementary characters
will sort higher than BMP characters, and  equal to
other supplementary characters.
"
But it does not happen. For a UCA collation, e.g.
utf32_unicode_ci, I see 0x0dc6 which is the weight for
character 0xfffd. I want the specified weight, 0xfffd.

How to repeat:
mysql> /* The four characters in the INSERT string are
   /*>    00000041  # LATIN CAPITAL LETTER A
   /*>    0001218F # CUNEIFORM SIGN KAB
   /*>    000121A7 # CUNEIFORM SIGN KISH
   /*>    00000042  # LATIN CAPITAL LETTER B
   /*> */
mysql> CREATE TABLE t (s1 CHAR(4) CHARACTER SET utf32 COLLATE utf32_unicode_ci);
Query OK, 0 rows affected (0.01 sec)

mysql> INSERT INTO t VALUES (0x000000410001218f000121a700000042);
Query OK, 1 row affected (0.01 sec)

mysql> SELECT HEX(WEIGHT_STRING(s1)) FROM t;
+------------------------+
| HEX(WEIGHT_STRING(s1)) |
+------------------------+
| 0E330DC60DC60E4A       |
+------------------------+
1 row in set (0.00 sec)

... Wrong. The correct result would be:
+------------------------+
| HEX(WEIGHT_STRING(s1)) |
+------------------------+
| 0E33FFFDFFFD0E4A       |
+------------------------+
1 row in set (0.00 sec)
[8 Dec 2007 19:21] Sveta Smirnova
Thank you for the report.

Verified as described.
[17 Mar 2008 13:19] Alexander Barkov
A shorter test case representing the same problem for UTF32:

mysql> select hex(weight_string(_utf32 0x00010000 collate utf32_unicode_ci));
+----------------------------------------------------------------+
| hex(weight_string(_utf32 0x00010000 collate utf32_unicode_ci)) |
+----------------------------------------------------------------+
| 0DC6                                                           |
+----------------------------------------------------------------+
1 row in set (0.00 sec)
[17 Mar 2008 13:21] Alexander Barkov
A simple test demonstrating the same problems for utf16:

mysql> select hex(weight_string(_utf16 0xD800DC00 collate utf16_unicode_ci));
+----------------------------------------------------------------+
| hex(weight_string(_utf16 0xD800DC00 collate utf16_unicode_ci)) |
+----------------------------------------------------------------+
| 0DC6                                                           |
+----------------------------------------------------------------+
1 row in set (0.00 sec)
[17 Mar 2008 13:51] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/44107

ChangeSet@1.2606, 2008-03-17 17:45:57+04:00, bar@mysql.com +5 -0
  Bug#33077 Character sets: weight of supplementary characters is not 0xfffd
  Problem: interpretation of 0xFFFC was wrong.
  weight_string returned 0x0DC6 as weight for supplementary
  characters (which is weight for character U+FFFC)
  Fix: return 0xFFFC instead of 0x0DC6.
[18 Mar 2008 3:25] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/44162

ChangeSet@1.2606, 2008-03-18 07:20:22+04:00, bar@mysql.com +5 -0
  Bug#33077 Character sets: weight of supplementary characters is not 0xfffd
  Problem: interpretation of 0xFFFD was wrong.
  weight_string returned 0x0DC6 as weight for supplementary
  characters (which is weight for character U+FFFD)
  Fix: return 0xFFFD instead of 0x0DC6.
[18 Mar 2008 9:23] Alexey Kopytov
http://lists.mysql.com/commits/44162 looks good to me.
[18 Mar 2008 12:59] Alexander Barkov
Pushed into 6.0.5-engines
[27 Mar 2008 17:50] Bugs System
Pushed into 6.0.5-alpha
[27 Mar 2008 19:00] Paul DuBois
Noted in 6.0.5 changelog.

The weight for supplementary Unicode characters should be 0xFFFD, but
the WEIGHT_STRING() function returned 0x0DC6 instead.