Description:
Server: mysql community server version 5.5.19 and 5.1.46
Connector/J : mysql java driver 3.1.14 and 5.1.18
JDK : Oracle JRockit 1.6.0_29-b11 and Sun Java 1.6.0_22-b04
we create table using charset latin1 and create a varchar column, using it to store raw bytes in latin1 string, we use this method to store multi-bytes string in the past on mysql server 3.x versions, and we can reconstruct the string from bytes in the program.
while we upgrade the server to 5.1 and 5.5 , this method still work prefect for us, but after then we update the driver to 3.1.14 or get the latest version 5.1.18, we can't get the right bytes from the server with connect parameter characterEncoding=latin1.
for example:
we store a multi-bytes string in the latin1 varchar column
the chinese char : 乄
the hex code in GBK: 0x8157
the bytes in latin1: 0x81 0x57
the data stored in the table is correct.
when we query from java, it looks strange.
if we use ResultSet.getBytes(), it will get the right bytes.
but if we use ResultSet.getString(), it will give me wrong string like:
0x3f 0x57 in bytes.
we use new String(rs.getString().getBytes("latin1"),"GBK") to reconstruct the string in the java code.
i look into the source code of the driver , i found that the driver using cp1252 stands for server encoding latin1 or ISO8859_1.
but cp1252 is not the prefect latin1 charset in java.
in the SingleByteCharsetConverter
using charset cp1252 to create the SingleByteCharsetConverter,
will cause 5 bytes wrong.
0x81 will change to 0x3f(?)
0x8d will change to 0x3f(?)
0x8f will change to 0x3f(?)
0x90 will change to 0x3f(?)
0x9d will change to 0x3f(?)
it means that this 5 character are not defined in java cp1252, but they are defined in java latin1/ISO8859_1
so i search in google and found mysql connector/j use cp1252 replace the latin1 from version 3.1.13:
-05-26-06 - Version 3.1.13
...
- - Map "latin1" on MySQL server to CP1252 for MySQL > 4.1.0.
...
i know that's a bad habit to store multi-byte string in latin1 charset, but for historical reasons we still use it. and our framework like iBatis or Hibernate use type String to represent this column.
so we can only use ResultSet.getString() if we don't change our code.
and i don't why the mysql connector/j will use cp1252 to filter the string.
How to repeat:
1. create table with one latin1 charset column
2. insert the record with char '乄' using command line or python api
3. select the result using mysql connector/J and use getString to retrieve the string.
Suggested fix:
don't use cp1252 in java , use latin1 instead.