Bug #44045 UTF8 TEXT fields are being converted to CP1252
Submitted: 2 Apr 2009 14:17 Modified: 2 Apr 2009 15:16
Reporter: Rob Stone Email Updates:
Status: Not a Bug Impact on me:
None 
Category:Connector / J Severity:S2 (Serious)
Version:5.1.7 (Server version 5.0.34) OS:Windows
Assigned to: CPU Architecture:Any
Tags: CP1252, text, utf8

[2 Apr 2009 14:17] Rob Stone
Description:
We have a database with the following server variables set:
character_set_server=utf8
character_set_system=utf8
character_set_database=utf8
In addition we have created all the tables as UTF8.

Several of the table contain TEXT fields, and we are storing Japanese text (unicode) in these fields. If I try and retrieve a field via JDBC I end up with corrupted data. Tracing through the MySQL source reveals that the TEXT field character sets are being incorrectly selected as 'CP1252' instead of UTF8. If I set the connection property 'useOldUTF8Behavior' to true then the correct text is returned.

Again, tracing the code through I can see that the server is reporting a character set index of 8 for TEXT fields, whereas for CHAR fields it is reporting an index of 33.

How to repeat:
create a utf8 table containing a TEXT field, insert unicode characters into this field and then try and retrieve via JDBC, the characters are corrupted.

Suggested fix:
Setting the 'useOldUTF8Behavior' flag to true fixes the problem, but I'm a bit wary about using 'Old' behavior that looks like it was only intended for use with 4.0 servers.
[2 Apr 2009 14:33] Tonci Grgin
Hi Rob and thanks for your report.

I do not think there's a bug here but rather misconfiguration. What does your character_set_results variable say? What is the exact structure of table in question? Mind you, if c-set is not set fully it will be inherited first from table then from database and so on.

So please attach *full* test case, *with* connection string.
[2 Apr 2009 14:34] Tonci Grgin
Forgot to say about useOldUTF8Behavior:
  Using the UTF-8 Character Encoding - Prior to MySQL server version 4.1, the UTF-8 character encoding was not supported by the server, however the JDBC driver could use it, allowing storage of multiple character sets in latin1 tables on the server.

So probably thing is in table structure...
[2 Apr 2009 14:46] Rob Stone
Hello Tonci,

Here is my table structure:
CREATE TABLE `dlr_accounts_ext_1` (

  `ACCOUNT_ID` varchar(32) character set latin1 NOT NULL default '',

  `TIMESTAMP` datetime NOT NULL,

  `LAST_USER` varchar(32) character set latin1 NOT NULL default '',

  `TEXT_1` text character set latin1 NOT NULL,

  PRIMARY KEY  (`ACCOUNT_ID`),

) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

The 'character_set_results' variable is reported as 'latin1', however this variable is not one that you can configure on the server. If I set the 'characterSetResults' property then the text is corrupted in exactly the same way.

If I pass the following connection properties in then everything works fine:
useOldUTF8Behavior=true
useUnicode=true
characterEncoding=UTF-8"

Rob
[2 Apr 2009 14:48] Rob Stone
The last line should of course read:
characterEncoding=UTF-8
[2 Apr 2009 14:56] Rob Stone
OK, looking at that I can see that the columns are set as 'character set latin1', we had used an alter table (via MySQL query browser) and had assumed that setting the table character set would also set the column character sets - I'm guessing this isn't the case and will probably (hopefully) explain our problems !
[2 Apr 2009 15:03] Tonci Grgin
Rob, unfortunately I think this is not a bug and here's why:
...
  `TEXT_1` text character set latin1 NOT NULL,   <<<
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

So you are trying to fetch UTF8 stream from table containing latin1 bytes, won't work. Of course, being that field in question is latin1, autodetection in c/J sets character_set_results to latin1 too. This is *exact* case why useOldUTF8Behavior was introduced.

Now I wonder what is "CHARSET=utf8 COLLATE=utf8_unicode_ci" doing on table level as *all* fields are latin1...
[2 Apr 2009 15:16] Rob Stone
Thank you Tonci, it turns out we have been making some very big incorrect assumptions about how the data is stored in the database. We have now figured out how to set the correct character sets on the table and once we re-insert the data it can then be retrieved correctly via JDBC. Do you know why the default charset is set to latin1 rather than utf-8 as this would have saved us so many problems.

Thanks once again
Rob
[3 Apr 2009 5:51] Tonci Grgin
Yes Rob I know. Protocol is done that way, I guess, for historical reasons. So, for example, when connects libmysqlclient say "I R latin1" and does that using binary protocol on the wire. Thus you'll usually see 'SET NAMES something' command issued by any connector just to avoid default latin1 behavior of libmysqlclient.

Glad the problem is solved.