MySQL Bugs: #64071: CP1252 charset in java cause wrong data reading from latin1 tables

Bug #64071	CP1252 charset in java cause wrong data reading from latin1 tables
Submitted:	19 Jan 2012 11:13	Modified:	19 Jan 2012 17:43
Reporter:	Max Gao	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	Connector / J	Severity:	S3 (Non-critical)
Version:	> 3.1.13	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	cp1252 latin1 charset java

Description:
Server: mysql community server version 5.5.19 and 5.1.46
Connector/J : mysql java driver 3.1.14 and 5.1.18
JDK : Oracle JRockit 1.6.0_29-b11 and Sun Java 1.6.0_22-b04

we create table using charset latin1 and create a varchar column, using it to store raw bytes in latin1 string, we use this method to store multi-bytes string in the past on mysql server 3.x versions, and we can reconstruct the string from bytes in the program.

while we upgrade the server to 5.1 and 5.5 , this method still work prefect for us, but after then we update the driver to 3.1.14 or get the latest version 5.1.18, we can't get the right bytes from the server with connect parameter characterEncoding=latin1.

for example:
we store a multi-bytes string in the latin1 varchar column
the chinese char : 乄
the hex code in GBK: 0x8157
the bytes in latin1: 0x81 0x57

the data stored in the table is correct.
when we query from java, it looks strange.
if we use ResultSet.getBytes(), it will get the right bytes.
but if we use ResultSet.getString(), it will give me wrong string like:
0x3f 0x57 in bytes.
we use new String(rs.getString().getBytes("latin1"),"GBK") to reconstruct the string in the java code.

i look into the source code of the driver , i found that the driver using cp1252 stands for server encoding latin1 or ISO8859_1.
but cp1252 is not the prefect latin1 charset in java.

in the SingleByteCharsetConverter
using charset cp1252 to create the SingleByteCharsetConverter,
will cause 5 bytes wrong.
0x81 will change to 0x3f(?)
0x8d will change to 0x3f(?)
0x8f will change to 0x3f(?)
0x90 will change to 0x3f(?)
0x9d will change to 0x3f(?)

it means that this 5 character are not defined in java cp1252, but they are defined in java latin1/ISO8859_1

so i search in google and found mysql connector/j use cp1252 replace the latin1 from version 3.1.13:

-05-26-06 - Version 3.1.13
...
- - Map "latin1" on MySQL server to CP1252 for MySQL > 4.1.0.
...

i know that's a bad habit to store multi-byte string in latin1 charset, but for historical reasons we still use it. and our framework like iBatis or Hibernate use type String to represent this column.
so we can only use ResultSet.getString() if we don't change our code.

and i don't why the mysql connector/j will use cp1252 to filter the string.

How to repeat:
1. create table with one latin1 charset column

2. insert the record with char '乄' using command line or python api

3. select the result using mysql connector/J and use getString to retrieve the string.

Suggested fix:
don't use cp1252 in java , use latin1 instead.

Thank you for the report.

This is actually not c/J fault that it can not handle such issues. Please convert your data using method described at http://dev.mysql.com/doc/refman/5.0/en/charset-conversion.html, so you can use all features of current character set support.

Neither latin1 nor cp1252 is the universal choice, it depends on the data, which you should check first of all. Anyway, here is a related post about the charset cp1252, this problem can only be solved completely by the jdk-dev team, I guess.

http://lieeil.info/codec-problem-when-using-encoding-cp1252-with-jdbc-to-mysql/