Bug #15375 Unassigned multibyte codes are broken into parts when converting to Unicode
Submitted: 1 Dec 2005 7:36 Modified: 11 Apr 2006 13:19
Reporter: Alexander Barkov Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version: OS:
Assigned to: Alexander Barkov CPU Architecture:Any

[1 Dec 2005 7:36] Alexander Barkov
Description:
Big5 character 0xC840 doesn't have Unicode mapping.
It is clear it cannot be converted to ucs2.
However, 0xC840 it is for sure a double byte sequence.
It is broken into parts when converting to ucs2:

mysql> select hex(convert(_big5 0xc840 using ucs2));
+---------------------------------------+
| hex(convert(_big5 0xc840 using ucs2)) |
+---------------------------------------+
| 003F0040                              |
+---------------------------------------+
1 row in set (0.00 sec)

I.e. character set conversion routine scans an unknown
character and skip only one byte 0xC8, then continue
to convert from the next byte 0x40.

How to repeat:
Run the above query

Suggested fix:
Skip both parts of an unknown multibyte sequence,
so the above query returns 0x003F.
[1 Dec 2005 7:38] Alexander Barkov
An example of the same problem with GBK:

mysql> select hex(convert(_gbk 0xA140 using ucs2));
+--------------------------------------+
| hex(convert(_gbk 0xA140 using ucs2)) |
+--------------------------------------+
| 003F0040                             |
+--------------------------------------+
1 row in set (0.00 sec)
[12 Dec 2005 7:28] Alexander Barkov
See also #15376:

0x8FABF8 is a valid UJIS multibyte sequence (3 bytes) in this format:

[x8F][xA1-xFE][xA1-xFE]

corresponding to JIS-X-0212 code 0x2B78

(i.e. remove the 0x8F introducer, then substruct 0x8080 from 0xABF8).

When converting this character to UCS2, the result is 0x0000,
which is wrong.

It is true that this character doesn't have Unicode mapping,
however the expected result is to return 0x003F QUESTION MARK,
like impossible conversion usually does.
[12 Dec 2005 17:48] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/76
[23 Mar 2006 9:12] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/4055
[23 Mar 2006 16:11] Alexander Barkov
Fixed in 4.1.19, 5.0.20, 5.1.8.
[11 Apr 2006 13:19] Paul DuBois
Noted in 4.1.19, 5.0.20, 5.1.8 changelogs.

During conversion from one character set to
<literal>ucs2</literal>, multi-byte characters with no
<literal>ucs2</literal> equivalent were converted to multiple
characters, rather than to <literal>0x003F QUESTION
MARK</literal>. (Bug #15375)