Bug #10529 Irreversible conversion of 0x5C between sjis and other char sets
Submitted: 11 May 2005 7:04 Modified: 16 May 2005 18:14
Reporter: Hartmut Holzgraefe Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:4.1.11 OS:Any (any)
Assigned to: Alexander Barkov CPU Architecture:Any

[11 May 2005 7:04] Hartmut Holzgraefe
Description:
With sjis configured in the client side and other char
set than sjis in the server side, 0x5C is correctly stored as 0x5C in a server with any char set.

But if you select the stored data from the server,
0x815F(full-width backslash) is returned instead of 0x5C.

0x5C and 0x815F are diffrent characters.
Irreversible storage and selection like this behaviour seems a bug.

How to repeat:
[How to reproduce]
(1) Create table

CREATE TABLE tbl1(col1 TEXT CHARACTER SET sjis,
col2 TEXT CHARACTER SET ujis,
col3 TEXT CHARACTER SET utf8
);

(2) Set the client char set to sjis and store 0x5C

SET NAMES sjis;
INSERT INTO tbl1 VALUES('\\','\\','\\');

(3) Confirm stored character in hex

SELECT HEX(col1),HEX(col2),HEX(col3) from tbl1;
+-----------+-----------+-----------+
| HEX(col1) | HEX(col2) | HEX(col3) |
+-----------+-----------+-----------+
| 5C | 5C | 5C |
+-----------+-----------+-----------+

No problem so far.

(4) Confirm selected and converted character
(client char set is sjis)

SELECT col1,col2,col3 from tbl1;
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| \ | \ | \ |
+------+------+------+

Characters converted from ujis and utf8 turn out
to be "full-width backslash".

(5) Confirm selected and converted character in hex

SELECT HEX(col1) AS sjis,
HEX(CONVERT(col2 USING sjis)) AS ujis,
HEX(CONVERT(col3 USING sjis)) AS utf8 FROM tbl1;
+------+------+------+
| sjis | ujis | utf8 |
+------+------+------+
| 5C | 815F | 815F |
+------+------+------+

0x5C stored in ujis and utf8 columns is displayed as
0x815F after conversion to sjis, instead of 0x5C.
[16 May 2005 18:14] Alexander Barkov
This is about round trip conversion issue we discussed with 
Shuichi, and expected behavior for sjis in MySQL.
 
If you want to convert 0x005C of Unicode back to 0x5C, you
should use cp932.
[20 May 2005 5:30] Shuichi Tamagawa
Hi Hartmut

When 'sjis' characterset is used, MySQL convert the characters based on the rule defined by Unicode Consortium.

For ASCII characters to which 0x5C character belong to:
http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT

For JIS0208 characters to which 0x815F character belong to:
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT

The problem is that both 0x5C and 0x815F is mapped to 0x005C of Unicode. So, when 0x005C is converte back to sjis, MySQL has to chose one of these characters. But there is no 'standard rule' for the conversion from Unicode to sjis. The rule of sjis character set of MySQL choses 0x815F. This is the expected behavior. MySQL could chose 0x5C, but in that case, 0x815C would be 
irreversible.

On the other hand, cp932 has slightly different Unicode conversion rule from sjis in additoin to extended character support. In cp932 characterset, the characters are converted to Unicode based on the rule defined by Microsoft http://www.microsoft.com/globaldev/reference/dbcs/932.mspx . In this rule 0x5C is mapped to 0x005C and 0x815F is mapped to FF3C. So for both of the characters, unicode conversion is reversible.

Hope this helps your understanding.