Bug #7607 Windows-31J / MS932 should be regarded as aliases for "sjis" character set
Submitted: 1 Jan 2005 17:18 Modified: 9 Aug 2005 22:43
Reporter: Moriyoshi Koizumi
Status: Closed
Category:Server Severity:S2 (Serious)
Version:3.0.16 OS:Any (*)
Assigned to: Mark Matthews Target Version:

[1 Jan 2005 17:18] Moriyoshi Koizumi
Description:
In a nutshell, three other encoding names, Windows-31J, Shift_JIS and MS932 are the
aliases for "SJIS" encoding. Connector/J should be able to handle those also when it
comes to "characterEncoding" property.

How to repeat:
Specify one of "Windows-31J", "Shift_JIS" and "MS932" to the "characterEncoding" property
in a JDBC URL instead of "SJIS", and execute a statement "SHOW VARIABLES" to see what
value has been set in "character_set_client" or "character_set_connection".

e.g. jdbc:mysql://localhost/blah?characterEncoding=MS932

Suggested fix:
I already made a patch for this issue.

http://www.voltex.jp/patches/mysql-connector-j-3.0.16-sjis-alias-20050102.patch.diff
[22 Jan 2005 11:30] Aleksey Kishkin
Please correct me if I wrong, but   "Shift_JIS" is different from "Windows-31J". As far as
I understand MS932 (and Windows-31J) contains some OEM extensions, doesn't it?
[22 Jan 2005 13:51] Moriyoshi Koizumi
Yes, you are right in a sense. It might sound so overly simplified that I would like to
make the description somewhat shorter.

The name "Shift_JIS" can be used to refer to two different things, and the usage and
semantics is often confusing. Its primary meaning is the name of a code set that is
designated to represent the superset of several JIS character sets, without OEM
extensions. The secondary meaning is the name of an encoding scheme, which specifies how
the implementation-specific codes are converted to a sequence of consecutive bytes so
they are ready either to transfer through a certain communication channel or to store
somewhere persistent.

In the secondary meaning of Shfit_JIS, a decoded result pretty much depends on the
original implementation-specific coding system, and is not necessarily the same as
another. However, the fact has never cause much disorder in practice because available
implementations share the same coding system that is based on the one illustrated in the
JIS specification.

Prior to MySQL 4.1.x the name "SJIS" was used in the latter sense, with which any
character encoded in Shift_JIS in the primary sense or OEM-extended Windows-31J would be
allowed to store in a backend, send queries and receive results.

Starting from MySQL 4.1.x the name "SJIS" continues to have its original meaning when the
table encoding is set to "SJIS" and no conversions are performed between the server and
the client; that is, at the time when the encoding for the client program, a connection
and result sets are all the same.

Nevertheless, that might rather be an exception; there are many other cases where the
meaning differs from the initial. When MySQL decides to perform some conversion between
the backend and the frontend, in the event the encoding of the specified table is
different than that of the client program or whatever belongs to the frontend, data are
converted through the system specific iconv(3) facility which takes parameters to be
expressed in the primary sense.

Although the behaviour described above is the worst implementation bug brought in the
newest MySQL series, Connector-J should be able to deal with any of those situations. To
do so, it can then report to the server the preferred character set is "SJIS" while the
actual character set (encoding) used in a Caffeinated[TM] client program is MS932.
[22 Jan 2005 14:00] Moriyoshi Koizumi
> To do so, it can then report to the server the preferred character
> set is "SJIS" while the actual character set (encoding) used in a
> Caffeinated[TM] client program is MS932.

In this sentense, I should have said "to do so, it then has to be able to report to the
server its preferred character set is "SJIS" while the actual character set (encoding)
used in a Caffeinated client program is MS932."
[26 Jan 2005 0:04] Mark Matthews
This is fixed for 3.0, 3.1 and 3.2. You should be able to checkout a nightly snapshot of
3.1 or 3.0 this evening after 00:00 GMT at http://downloads.mysql.com/snapshots.php if
you want to test before the release of 3.0.17 or 3.1.7.

Thanks for your bug report!
[27 Jan 2005 15:36] Moriyoshi Koizumi
new patch.

Attachment: mysql-connector-j-3.0.17-cp932-20050127.patch.diff.gz (application/x-gzip, text), 709 bytes.

[27 Jan 2005 15:38] Moriyoshi Koizumi
Already mentioned in the other bug report, the initial patch didn't completely covered the
entire issue I addressed. Please look through the new patch just submitted. Thanks.
[27 Jan 2005 15:48] Mark Matthews
I was actually testing against an older version of 4.1, so with a patch similar in
functionality to yours, and MySQL-4.1.9, this now appears to be fixed fully.

Would you mind testing tonight's snapshot build after 00:00 GMT?
[27 Jan 2005 16:31] Moriyoshi Koizumi
No problem. Thank you for looking into it.
[28 Jan 2005 5:44] Moriyoshi Koizumi
It surely worked. But I don't really like this kind of workaround:

    try {
        charsetName = this.indexToCharsetMapping[charsetIndex];

        if ("sjis".equalsIgnoreCase(charsetName)) {
            // Use our encoding so that code pages like Cp932 work
            charsetName = getEncoding();
        }
    } catch (ArrayIndexOutOfBoundsException outOfBoundsEx) {
        throw new SQLException(
            "Unknown character set index for field '" + charsetIndex
            + "' received from server.",
            SQLError.SQL_STATE_GENERAL_ERROR);
    }

I'd rather see my patch integrated to the codebase.
[28 Jan 2005 14:40] Mark Matthews
I'm not sure why you don't like the workaround? The only thing it does is use the JVM
encoding that was specified by the user, if the server encoding is SJIS. That means if
the user didn't specify anything and the driver auto-detected 'sjis', the JVM encoding
would be SJIS, if the user specified SJIS, then the JVM encoding used would be 'sjis', if
the user specified MS932 then the driver will use the JVM encoding "MS932", etc.
[28 Jan 2005 23:21] Moriyoshi Koizumi
Because there might be another case where the same encoding has double meanings like
CP932. Most likely CP949 (EUC_KR) and CP936 (GB2312), each of those are a MS extension of
the parenthesised one. I didn't fully confirmed yet...
[28 Jan 2005 23:34] Moriyoshi Koizumi
Here follow more:
CP949 (EUC-KR)
CP949c (EUC-KR)
CP970 (EUC-KR)
IBM949 (EUC-KR)
CP936 (GBK)
CP948 (GB2312 = EUC-CN)
IBM948 (GBK)
CP950 (BIG5)
CP964 (CNS11643)

For further info, please consult:
http://www.ingrid.org/java/i18n/encoding/table.html
[29 Jan 2005 0:17] Mark Matthews
However, the _real_ answer is to add those encodings to the server, not keep patching the
driver, which is why I'm not real willing to make this more complex in the driver.
[29 Jan 2005 2:35] Moriyoshi Koizumi
It seems you don't quite have a precise picture of how "encodee" and the corresponding
encoding names are handled in the server. As I noted above, CP949 is a "superset" of
EUC-KR when we see them as character sets while CP949 has exactly the same encoding
scheme as EUC-KR within the MySQL implementation (though actually they are different in
specification.)
[29 Jan 2005 2:46] Mark Matthews
> It seems you don't quite have a precise picture of how "encodee" and the
> corresponding encoding names are handled in the server. As I noted above, CP949
> is a "superset" of EUC-KR when we see them as character sets while CP949 has
> exactly the same encoding scheme as EUC-KR within the MySQL implementation
> (though actually they are different in specification.)

I'm going to defer to the person who actually implemented the server-side of things, as
we've already seen issues/bugs related to the fact that even though certain multi-byte
character sets are/were _supposed_ to be supersets, they confused the parser. Until I
have a clear picture of what is supported and what isn't, I'm hesitant to make _any_
further changes to the JDBC driver, as these changes will also have to be eventually
propagated to multiple drivers, and the decisions made need to be correct and
consistent.
[29 Jan 2005 2:56] Moriyoshi Koizumi
Okay, let's see...
[15 Feb 2005 18:27] Mark Matthews
This is from our charsets developer:

However, for Japanese it is not so simple. 
They are not supersets/subsets for each others. They differ is several
characters. For example, sjis is not a superset for cp932, and
cp932 is not a superset for sjis. And the different characters are
quite widely used, like TILDE and REVERSE SOLIDUS.

I'm not sure about EUC-KR and CP949, need to check if on of them
is a strict superset and need to check Chinese charsets too.

I think the best solution is to implement all contemporary Windows
character sets in MySQL asap, trying to eliminate subsets when it
is possible.