Bug #64831 UTF-8 data interpreted as Latin1
Submitted: 1 Apr 2012 18:57 Modified: 3 Apr 2012 5:49
Reporter: K. Brooke Nelson Email Updates:
Status: Not a Bug Impact on me:
None 
Category:Connector / J Severity:S2 (Serious)
Version:5.2.38 CE OS:Windows (Windows 7 (64-bit))
Assigned to: CPU Architecture:Any

[1 Apr 2012 18:57] K. Brooke Nelson
Description:
I have a Netbeans/Java/MySQL application that I originally wrote in C#/.net/Access. I successfully ported the Access database to MySQL converting it from Unicode to utf-8 in the process. The data has a lot of European characters (umlauts, accents, etc). I can view the data in the MySQL Workbench and verify that the European characters were properly converted. I can also display my data correctly in my Java/Swing app, but when I update the any field that has a European character in it, it gets written incorrectly as a two-character sequence (ex.: ä get converted to ä. I connect to the d/b via a DSN which has utf8 set as the connection charset. Netbeans and my project are configured for utf-8 operation. MySQL is configured with utf-8 as the default charset. My database and every table in it has utf8 as the default charset and collation sequence. Some software entity is converting my utf-8 European characters to Latin1. For example: the Turkish dotless i ‘ı’ is converted to ‘ı’. The curios thing about this is that the UTF-8 hex code for the dotless i is 0xc4b1 which is identical to the Latin1 hex code for ‘ı’, which means that the only difference between the original string and the updated one is how the codes are interpreted by the software. If the field containing the ‘ı’ is updated and retrieved again, it now appears as ‘ı’, since the ‘Ä’ and the ‘±’ have each been erroneously interpreted as Latin1 characters.

How to repeat:
1. Create a simple UTF-8 database containing a text column. Store a non-ASCII character in that field via MySQL Workbench.
2. Create a DSN for the database; click on "Details <<"; and select "utf8" from the "Character Set" combobox on the "Connection" tab.
3. On Windows 7 with Netbeans 7.0.1, start a new Java project.
4. Configure the project for UTF-8 operation by adding " -encoding UTF-8" to "Additional compiler options" found on the "Project Properties" page under "Compiling". 
5. Write a simple Java program to retrieve, display and update the text field containing the non-ASCII character.
6. The text field should display correctly before it has been updated or rewritten to the database. When displayed following the first update, the non-ASCII character will have been replaced by two different non-ASCII characters. After the second update there will be 4 characters, after the third update 8, ad infinitem.

Suggested fix:
I have read that with earlier versions of MySQL this problem could be avoided by prepending every SQL UPDATE and INSERT command with “SET NAMES ‘utf8’;“, but that command is no longer available in version 5.2.38. Perhaps the "“SET NAMES" command needs to be reinstated.
[1 Apr 2012 19:03] Valeriy Kravchuk
Are you really sure that

set names utf8;

does not work any more?
[1 Apr 2012 19:22] K. Brooke Nelson
In response to the question: "Are you really sure that set names utf8; does not work any more?", I always get an exception thrown with the folloing message: "SET NAMES not allowed by driver". My driver string is "com.mysql.jdbc.Driver".
[2 Apr 2012 6:40] Valeriy Kravchuk
So, you use Connector/J (not Workbench) when you get the problem. What exact version of Connector/J do you use? What exact URL do you use?

Check http://dev.mysql.com/doc/refman/5.5/en/connector-j-reference-configuration-properties.html for characterSetResults and other connection properties that may help to get tutf-8 data properly.
[2 Apr 2012 19:02] K. Brooke Nelson
I am using MySQL ODBC 5.1 Driver. I keep my driver and url info in an external properties file:

mysql.url=jdbc:odbc:NelsonFamily_MySQL
mysql.driver=com.mysql.jdbc.Driver

'NelsonFamily_MySQL' is, of course, the name of my DSN. The only non-default setting I have in the DSN is 'Character set: utf8'.

I tried to connect via a normal url, but my queries fail giving me an error message which is a paragraph long. The DSN connection allows all my SQL to work.
[2 Apr 2012 22:07] K. Brooke Nelson
I figured out why I could not use the non-DSN connect string. It was due to the use of refreshRow() in my program which was preventing my queries from working. I comment out the refreshRow statements and everything works fine. I had tried to add options "?CharSet=utf8&useUnicode=true&characterEncoding=utf8" to the DSN version of the connection url, but they caused my program to terminate. When the same options are applied to the non-DSN url, my character corruption problem disappears: I can now read utf-8 data and update it correctly. Thanks for the assistance.
[3 Apr 2012 5:49] Valeriy Kravchuk
Looks like the problem was not related to any bug in MySQL software.