Bug #6549 Connector/J doesn't retrieve some Japanese characters properly
Submitted: 10 Nov 2004 13:38 Modified: 1 Feb 2005 20:25
Reporter: Yoshinori Matsunobu Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S4 (Feature request)
Version:3.0.x , 3.1.x OS:
Assigned to: Alexander Barkov CPU Architecture:Any

[10 Nov 2004 13:38] Yoshinori Matsunobu
Description:
Connector/J doesn't retrieve some Japanese characters(sjis) properly.

MySQL version: 4.1.7
Connector/J version: 3.0.x , 3.1.x

This happens when people use non jisx0201/0208 characters 
(such as NEC Row 13,IBM Selected Kanji, IBM Selected Non-Kanji).

Connector/J could always insert them properly.
But at select, it couldn't retrieve them properly (automatically converted to "?").

How to repeat:
Create SJIS table and insert NEC kanji
mysql> create table sjistable (col1 varchar(100) ) character set sjis;
Query OK, 0 rows affected (0.04 sec)
mysql> insert into sjistable values(0x878A);
Query OK, 1 row affected (0.03 sec)
mysql> show create table sjistable;
  `col1` varchar(100) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=sjis 

The following is a sample JDBC-SELECT code.
---
import java.sql.*;
import java.util.*;

public class MySQL41Test1 {

	public static void main(String[] args) throws Exception{
		Connection conn = null;
		Statement stmt = null;
		ResultSet rs = null;
		
		String url = "jdbc:mysql://<hostname>:<port>/<db>";
		String query="SELECT * FROM sjistable";
			
		Properties props = new Properties();
		props.put("user","<username>");
		props.put("password","<password>");
		props.put("useUnicode","true");
	//	props.put("characterEncoding","MS932");
		props.put("mysqlEncoding","sjis");

		Class.forName("com.mysql.jdbc.Driver");
		conn = DriverManager.getConnection(url,props);
		stmt = conn.createStatement();

		rs = stmt.executeQuery(query);
		while(rs.next()){	
			System.out.println(rs.getString(1));
		}
		conn.close();
	}
}

Suggested fix:
When I modified and build souce code of Connector/J like the following, 
it could successfully retrieve non jisx0201/0208 characters.

com.mysql.jdbc.CharsetMapping.java
From: tempMap.put("sjis", "SJIS");
To: tempMap.put("sjis", "MS932");

Sun-JVM has character set mappings of SJIS,MS932 and CP943.
Some Japanese developers like me would like to use not only SJIS, 
but also MS932 and CP943, but in Connector/J, we could use only SJIS,
that's the problem.
We appreciate if we could set SJIS, MS932 or CP943 as 
JVM-side character set by configuration parameters of Connector/J.
[16 Nov 2004 4:41] Mark Matthews
The JDBC driver can only support character sets that the server does. I'm turning this into a feature request for the server team to support the character sets you request.
[14 Dec 2004 19:19] Shuichi Tamagawa
Hi Mark

In this case server returns expected byte sequence to Connecter/J. However, these characters get garbled during the conversion to JVM character set.
The problem is that the mapping is hard-coded and users cannot specify the JVM character set as a configuration parameter. If they can do that, the problem will be solve without having them touch the source code.

Of course supporting cp932 character set on server side, which is compatible with MS932 of JVM, will be the best way to solve the problem. However, it is expected to be added to the 5.0 tree. So for 4.1 users, could you accept this as a request to Connecter/J too?

Thanks
[25 Jan 2005 23:13] Mark Matthews
Fixed for 3.0.17, 3.1.7 and 3.2.1. You can test with a nightly snapshot build before the release of these versions after 00:00 GMT this evening using a snapshot build from http://downloads.mysql.com/snapshots.php.
[26 Jan 2005 2:38] Yoshinori Matsunobu
I have tested the newest snapshot.
Some Japanese characters still get truncated when I select them.

I read the source code of the newest Connector/J.
---
CharsetMapping.java
        ucMap.put("WINDOWS-31J", "sjis");
        ucMap.put("MS932", "sjis");
        ucMap.put("SHIFT_JIS", "sjis");
        ucMap.put("CP943", "sjis");
        JAVA_UC_TO_MYSQL_CHARSET_MAP = Collections.unmodifiableMap(ucMap);
---
As far as I understand, this regards JVM character set
(WINDOWS-31J/MS932/SHIFT_JIS/CP943) as MySQL character set(sjis) when 
I "insert".

When I "select" WINDOWS-31J/MS932/CP943 characters, 
they still get truncated.

Are there any new parameters to select WINDOWS-31J/MS932/CP943 characters properly?

How to repeat:

mysql> create table sjistable (col1 varchar(100) ) character set sjis;
mysql> insert into sjistable values(0x835C);  //standard sjis
mysql> insert into sjistable values(0x878A); //NEC kanji

----
import java.sql.*;
import java.util.*;
import sun.misc.HexDumpEncoder;
import sun.misc.CharacterEncoder;

public class MySQL41Test1 {

	public static void main(String[] args) throws Exception{
		Connection conn = null;
		Statement stmt = null;
		ResultSet rs = null;
		
		String url = "jdbc:mysql://<hostname>:<port>/<db>";

		Properties props = new Properties();
		props.put("user","<username>");
		props.put("password","<password>");
		props.put("characterEncoding","MS932");

		Class.forName("com.mysql.jdbc.Driver");
		conn = DriverManager.getConnection(url,props);
		stmt = conn.createStatement();

		rs = stmt.executeQuery("SELECT * FROM sjistable");
		String ret = rs.getString(1);
		CharacterEncoder encoder =  new HexDumpEncoder();
		encoder.encode(ret.getBytes(),System.out);   // 83 5C

		rs.next();
		ret = rs.getString(1);
		encoder.encode(ret.getBytes(),System.out);   // 3F 3F: NEC kanji got truncated!

		conn.close();
	}
}
[26 Jan 2005 16:59] Mark Matthews
Coming from server -> client, this can't be truly fixed until the server supports the codepages in question, as the driver is relying on the fact that the server is reporting the _true_ character encoding in use (i.e. MS932), right now the server is reporting 'sjis', so the JDBC driver is using the SJIS character converter.

As a workaround, until this is fixed in the server, one could use new String(rs.getBytes(), "MS932").
[26 Jan 2005 20:53] Mark Matthews
This is definitely a server issue as the following test shows (notice that the second row has the NEC SJIS character truncated to 0 bytes). There is no workaround we can use on the client side as far as I know:

mysql> create table testBug7607 (sortCol int, col1 varchar(100) ) character set sjis;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into testBug7607 values(1, 0x835C);  -- standard sjis
Query OK, 1 row affected (0.00 sec)

mysql> insert into testBug7607 values(2, 0x878A); -- NEC kanji
Query OK, 1 row affected, 1 warning (0.00 sec)

mysql> select length(col1) from testBug7607; 
+--------------+
| length(col1) |
+--------------+
|            2 |
|            0 |
+--------------+
[26 Jan 2005 21:23] Moriyoshi Koizumi
Mark, which character set is specified for the table you are testing with? To correctly reproduce the situation, you most likely have to set @@character_set_client, @@character_set_connection and the character set of the table to "sjis".
[26 Jan 2005 21:26] Moriyoshi Koizumi
My bad, I just overlooked "character set sjis" there... Yet I'm wondering what about @@character_set_client and @@character_set_connection.
[26 Jan 2005 21:34] Mark Matthews
Here's the charset setup for both the MySQL client and the JDBC driver for the testcase, so I think the issue is just simply that the server is clobbering the character:

character_set_client = sjis
character_set_connection = sjis

I think this is the case, because the server issues the following warning:

mysql> insert into testBug7607 values(2, 0x878A); -- NEC kanji 
Query OK, 1 row affected, 1 warning (0.00 sec)

mysql> show warnings;
+---------+------+-------------------------------------------+
| Level   | Code | Message                                   |
+---------+------+-------------------------------------------+
| Warning | 1265 | Data truncated for column 'col1' at row 1 |
+---------+------+-------------------------------------------+
1 row in set (0.00 sec)
[27 Jan 2005 1:11] Yoshinori Matsunobu
Mark, which version of MySQL server are you testing ?
As far as I know, that bug was fixed from 4.1.7.
The followings are the result of 4.1.9.

mysql> create table testBug7607 (sortCol int, col1 varchar(100) )
    -> character set sjis;
Query OK, 0 rows affected (0.01 sec)

mysql> show create table testBug7607;
+-------------+-----------------------------------------------------------------
-----------------------------------------------------------------------+
| Table       | Create Table
                                                                       |
+-------------+-----------------------------------------------------------------
-----------------------------------------------------------------------+
| testBug7607 | CREATE TABLE `testBug7607` (
  `sortCol` int(11) default NULL,
  `col1` varchar(100) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=sjis |
+-------------+-----------------------------------------------------------------
-----------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql>  insert into testBug7607 values(1, 0x835C);
Query OK, 1 row affected (0.00 sec)

mysql> insert into testBug7607 values(2, 0x878A);
Query OK, 1 row affected (0.00 sec)

mysql> select length(col1) from testBug7607;
+--------------+
| length(col1) |
+--------------+
|            2 |
|            2 |
+--------------+
2 rows in set (0.00 sec)

mysql> select hex(col1) from testBug7607;
+-----------+
| hex(col1) |
+-----------+
| 835C      |
| 878A      |
+-----------+
2 rows in set (0.00 sec)

mysql> show variables like '%char%';
+--------------------------+-------------------------------------+
| Variable_name            | Value                               |
+--------------------------+-------------------------------------+
| character_set_client     | sjis                                |
| character_set_connection | sjis                                |
| character_set_database   | sjis                                |
| character_set_results    | sjis                                |
| character_set_server     | sjis                                |
| character_set_system     | utf8                                |
| character_sets_dir       | /usr/mysql419/share/mysql/charsets/ |
+--------------------------+-------------------------------------+
7 rows in set (0.00 sec)

Connector/J converts characters twice like this.
<server> -(1)-> <client> -(2)-> <JVM>

(1):Character conversion from "MySQL server character set" to "MySQL client(such as mysql) character set".If both are the same, no character conversion happens.
NEC kanji doesn't get truncated as identified above.

(2):Character conversion from "MySQL client character set" to "JVM character set". 
In the Connector/J source code(CharsetMapping.java), MySQL "sjis" is statically mapped with JVM "SJIS". JVM SJIS doesn't include NEC kanji(MS932 does),so NEC kanji get truncated at (2).
That's why I think this bug was caused by (2).
[27 Jan 2005 14:34] Moriyoshi Koizumi
It eventually turned out to be the case that the patch I posted on bug #7607 wouldn't suffice :(
I made a new one that also takes metadata into consideration.

You can fetch from
http://www.voltex.jp/patches/mysql-connector-j-3.0.17-cp932-20050127.patch.diff.gz
[1 Feb 2005 20:25] Mark Matthews
This is fixed in the nightly snapshot builds of 3.0, as the fix was the same as for BUG#7607 (at least the parts that could be fixed). The JDBC driver still can't support character sets that aren't subsets of character sets that exist on the server. The fix will also be available in 3.1 and 3.2 when the merge happens.

You can test nightly snapshot builds of 3.0 from http://downloads.mysql.com/snapshots.php