MySQL Bugs: #7607: Windows-31J / MS932 should be regarded as aliases for "sjis" character set

Bug #7607	Windows-31J / MS932 should be regarded as aliases for "sjis" character set
Submitted:	1 Jan 2005 16:18	Modified:	9 Aug 2005 20:43
Reporter:	Moriyoshi Koizumi	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server	Severity:	S2 (Serious)
Version:	3.0.16	OS:	Any (*)
Assigned to:	Mark Matthews	CPU Architecture:	Any

Description:
In a nutshell, three other encoding names, Windows-31J, Shift_JIS and MS932 are the aliases for "SJIS" encoding. Connector/J should be able to handle those also when it comes to "characterEncoding" property.

How to repeat:
Specify one of "Windows-31J", "Shift_JIS" and "MS932" to the "characterEncoding" property in a JDBC URL instead of "SJIS", and execute a statement "SHOW VARIABLES" to see what value has been set in "character_set_client" or "character_set_connection".

e.g. jdbc:mysql://localhost/blah?characterEncoding=MS932

Suggested fix:
I already made a patch for this issue.

http://www.voltex.jp/patches/mysql-connector-j-3.0.16-sjis-alias-20050102.patch.diff

Please correct me if I wrong, but   "Shift_JIS" is different from "Windows-31J". As far as I understand MS932 (and Windows-31J) contains some OEM extensions, doesn't it?

Yes, you are right in a sense. It might sound so overly simplified that I would like to make the description somewhat shorter.

The name "Shift_JIS" can be used to refer to two different things, and the usage and semantics is often confusing. Its primary meaning is the name of a code set that is designated to represent the superset of several JIS character sets, without OEM extensions. The secondary meaning is the name of an encoding scheme, which specifies how the implementation-specific codes are converted to a sequence of consecutive bytes so they are ready either to transfer through a certain communication channel or to store somewhere persistent.

In the secondary meaning of Shfit_JIS, a decoded result pretty much depends on the original implementation-specific coding system, and is not necessarily the same as another. However, the fact has never cause much disorder in practice because available implementations share the same coding system that is based on the one illustrated in the JIS specification.

Prior to MySQL 4.1.x the name "SJIS" was used in the latter sense, with which any character encoded in Shift_JIS in the primary sense or OEM-extended Windows-31J would be allowed to store in a backend, send queries and receive results.

Starting from MySQL 4.1.x the name "SJIS" continues to have its original meaning when the table encoding is set to "SJIS" and no conversions are performed between the server and the client; that is, at the time when the encoding for the client program, a connection and result sets are all the same.

Nevertheless, that might rather be an exception; there are many other cases where the meaning differs from the initial. When MySQL decides to perform some conversion between the backend and the frontend, in the event the encoding of the specified table is different than that of the client program or whatever belongs to the frontend, data are converted through the system specific iconv(3) facility which takes parameters to be expressed in the primary sense.

Although the behaviour described above is the worst implementation bug brought in the newest MySQL series, Connector-J should be able to deal with any of those situations. To do so, it can then report to the server the preferred character set is "SJIS" while the actual character set (encoding) used in a Caffeinated[TM] client program is MS932.

> To do so, it can then report to the server the preferred character
> set is "SJIS" while the actual character set (encoding) used in a
> Caffeinated[TM] client program is MS932.

In this sentense, I should have said "to do so, it then has to be able to report to the server its preferred character set is "SJIS" while the actual character set (encoding) used in a Caffeinated client program is MS932."

This is fixed for 3.0, 3.1 and 3.2. You should be able to checkout a nightly snapshot of 3.1 or 3.0 this evening after 00:00 GMT at http://downloads.mysql.com/snapshots.php if you want to test before the release of 3.0.17 or 3.1.7.

Thanks for your bug report!

new patch.

Attachment: mysql-connector-j-3.0.17-cp932-20050127.patch.diff.gz (application/x-gzip, text), 709 bytes.

Already mentioned in the other bug report, the initial patch didn't completely covered the entire issue I addressed. Please look through the new patch just submitted. Thanks.

I was actually testing against an older version of 4.1, so with a patch similar in functionality to yours, and MySQL-4.1.9, this now appears to be fixed fully.

Would you mind testing tonight's snapshot build after 00:00 GMT?

No problem. Thank you for looking into it.

It surely worked. But I don't really like this kind of workaround:

    try {
        charsetName = this.indexToCharsetMapping[charsetIndex];

        if ("sjis".equalsIgnoreCase(charsetName)) {
            // Use our encoding so that code pages like Cp932 work
            charsetName = getEncoding();
        }
    } catch (ArrayIndexOutOfBoundsException outOfBoundsEx) {
        throw new SQLException(
            "Unknown character set index for field '" + charsetIndex
            + "' received from server.",
            SQLError.SQL_STATE_GENERAL_ERROR);
    }

I'd rather see my patch integrated to the codebase.

I'm not sure why you don't like the workaround? The only thing it does is use the JVM encoding that was specified by the user, if the server encoding is SJIS. That means if the user didn't specify anything and the driver auto-detected 'sjis', the JVM encoding would be SJIS, if the user specified SJIS, then the JVM encoding used would be 'sjis', if the user specified MS932 then the driver will use the JVM encoding "MS932", etc.

Because there might be another case where the same encoding has double meanings like CP932. Most likely CP949 (EUC_KR) and CP936 (GB2312), each of those are a MS extension of the parenthesised one. I didn't fully confirmed yet...

Here follow more:
CP949 (EUC-KR)
CP949c (EUC-KR)
CP970 (EUC-KR)
IBM949 (EUC-KR)
CP936 (GBK)
CP948 (GB2312 = EUC-CN)
IBM948 (GBK)
CP950 (BIG5)
CP964 (CNS11643)

For further info, please consult:
http://www.ingrid.org/java/i18n/encoding/table.html

However, the _real_ answer is to add those encodings to the server, not keep patching the driver, which is why I'm not real willing to make this more complex in the driver.

It seems you don't quite have a precise picture of how "encodee" and the corresponding encoding names are handled in the server. As I noted above, CP949 is a "superset" of EUC-KR when we see them as character sets while CP949 has exactly the same encoding scheme as EUC-KR within the MySQL implementation (though actually they are different in specification.)

> It seems you don't quite have a precise picture of how "encodee" and the
> corresponding encoding names are handled in the server. As I noted above, CP949
> is a "superset" of EUC-KR when we see them as character sets while CP949 has
> exactly the same encoding scheme as EUC-KR within the MySQL implementation
> (though actually they are different in specification.)

I'm going to defer to the person who actually implemented the server-side of things, as we've already seen issues/bugs related to the fact that even though certain multi-byte character sets are/were _supposed_ to be supersets, they confused the parser. Until I have a clear picture of what is supported and what isn't, I'm hesitant to make _any_ further changes to the JDBC driver, as these changes will also have to be eventually propagated to multiple drivers, and the decisions made need to be correct and consistent.

Okay, let's see...

This is from our charsets developer:

However, for Japanese it is not so simple. 
They are not supersets/subsets for each others. They differ is several
characters. For example, sjis is not a superset for cp932, and
cp932 is not a superset for sjis. And the different characters are
quite widely used, like TILDE and REVERSE SOLIDUS.

I'm not sure about EUC-KR and CP949, need to check if on of them
is a strict superset and need to check Chinese charsets too.

I think the best solution is to implement all contemporary Windows
character sets in MySQL asap, trying to eliminate subsets when it
is possible.