Bug #42494 latin1 over UTF-8 encoding fail with mysql-connector-java-5.1.7
Submitted: 30 Jan 2009 22:16 Modified: 11 Feb 2009 13:46
Reporter: Wilkin Cheung Email Updates:
Status: Not a Bug Impact on me:
None 
Category:Connector / J Severity:S2 (Serious)
Version:5.0.19 OS:Linux (RedHat 4 Enterprise)
Assigned to: CPU Architecture:Any
Tags: character encoding, driver, jdbc, Latin1, UTF-8

[30 Jan 2009 22:16] Wilkin Cheung
Description:
Introduction

Our code is running Mysql JDBC driver version 3.1.8 (mysql-connection-java-3.1.8). We would like to upgrade the driver to version 5.1.7 (mysql-connection-java-5.1.7) for faster connectivity and bug fixes.

 
Issues

   1. Generate SQL statement in Java 5
   2. convert statement into UTF-8
   3. convert UTF-8 to latin1 (ISO-8859-1)
   4. Mind sends statement over to mysql server in latin1
   5. mysql runs statement in latin1
   6. mysql sends ResultSet to Mind in latin1

This works fine for mysql-connector-java-3.1.8
For example, we throw characters ® © é ™ ½ at the program.
The program transcode the above string to to latin1 representation ® © é ⢠½ (in hex: C2AE C2A9 C3A9 E284A2 C2BD) and then insert these characters into mysql table via driver version 3.1.8. To verify the characters are saved correctly in mysql, i used select hex(columnname) from tablename, this works fine with driver 3.1.8

Repeat the above with mysql-connector-java-5.1.7 (without changing any setting), select hex(columnname) from tablename gives AE A9 E9 99 BD, which is wrong. 

Storage Engine is INNODB
We store all fields in latin1
The field type is text.

JDBC String

jdbc:mysql://127.0.0.1:40747/?user=root&noDatetimeStringSync=true&characterEncoding=latin1

Mysql Variables

 mysql> show variables like 'character%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

Note

* I tried different combinations of JDBC connection string options listed in the following links - but still not fruitful. http://dev.mysql.com/doc/refman/5.1/en/connector-j-reference-configuration-properties.html

 
References

My latin1 reference comes from wikipedia -
http://en.wikipedia.org/wiki/ISO-8859-1

How to repeat:
See Above
[30 Jan 2009 23:14] Wilkin Cheung
The Mind is just a java servlet that runs on Tomcat
[3 Feb 2009 13:19] Susanne Ebrecht
Hello Wilkin,

I fear that this is not a bug. It is just a problem with understanding encoding handling.

First of all you should know:
JDBC always is using UTF8. You can't change this.

Second:
We have had a bug in older MySQL server 5.0 and 5.1 version according to converting utf8 data. So if you use an older MySQL version please update to actual server version (5.0.75 or 5.1.30).

Now let me try to explain you encoding handling:

Usually we are talking about three different kinds of encoding:

Environment encoding: that is the encoding your environment is using. Means your browser, or your shell, or your terminal or whatever.

For using JDBC this should be utf8. Means your browser should be set to utf8. Otherwise you will get gibberish because your browser wasn't able to interpret your keyboard right.

Client encoding: this means in MySQL the variables character_set_client, character_set_result, character_set_connection.

The client encoding should match to your environment encoding. So if your environment encoding is utf8 these values should be utf8 too.
When you are using JDBC then the JDBC driver will handle this for you.
You should not change this variables in your Java code by your own. When you will do it, you will get gibberish.

Database encoding: here is meant the encoding of your database/table/column. In MySQL every single column is able to have a different encoding.

It doesn't matter which encoding the column has as far as you never try to insert/select values that aren't possible in the character set. 
I mean here for example your column is latin1 and you try to get in/out hebrew signs. This won't work because latin1 don't know about hebrew.

So when you column is latin1 and your environment/client is utf8 and you don't use non-latin1 signs then you just can type what you want and the system will fully automatic and transparent convert your input from utf8 into latin1.

Just and example by using the letter ä:
Environment=utf8
client encoding is set to utf8
Column is latin1
You make: insert into t values('ä');

Now you can test with select length() or hex() if the ä is stored right.

Your column is latin1 so length result should be 1. Is the result 2 you have gibberish and either environment or client encoding is set wrong.

For utf8 columns an ä should have length 2. Also here if length is 1 it is gibberish.

The advantage from JDBC is that you just have to make sure that your environment (browser) is using utf8.

JDBC will full automatic and transparent negotiate the client encoding with your mysql system.

What happened at your szenario my guess is: your browser wasn't set to utf8 or Tomcat/apache wasn't set to utf8.
[6 Feb 2009 17:41] Wilkin Cheung
Thanks for the comments - turns out the JDBC connection string characterEncoding should be set to "ISO8859_1". Even though "ISO8859_1" is equivalent to "latin1".

Please close
[11 Feb 2009 13:46] Tonci Grgin
Wilkin, I'm glad problem is solved.