Bug #44045 | UTF8 TEXT fields are being converted to CP1252 | ||
---|---|---|---|
Submitted: | 2 Apr 2009 14:17 | Modified: | 2 Apr 2009 15:16 |
Reporter: | Rob Stone | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | Connector / J | Severity: | S2 (Serious) |
Version: | 5.1.7 (Server version 5.0.34) | OS: | Windows |
Assigned to: | CPU Architecture: | Any | |
Tags: | CP1252, text, utf8 |
[2 Apr 2009 14:17]
Rob Stone
[2 Apr 2009 14:33]
Tonci Grgin
Hi Rob and thanks for your report. I do not think there's a bug here but rather misconfiguration. What does your character_set_results variable say? What is the exact structure of table in question? Mind you, if c-set is not set fully it will be inherited first from table then from database and so on. So please attach *full* test case, *with* connection string.
[2 Apr 2009 14:34]
Tonci Grgin
Forgot to say about useOldUTF8Behavior: Using the UTF-8 Character Encoding - Prior to MySQL server version 4.1, the UTF-8 character encoding was not supported by the server, however the JDBC driver could use it, allowing storage of multiple character sets in latin1 tables on the server. So probably thing is in table structure...
[2 Apr 2009 14:46]
Rob Stone
Hello Tonci, Here is my table structure: CREATE TABLE `dlr_accounts_ext_1` ( `ACCOUNT_ID` varchar(32) character set latin1 NOT NULL default '', `TIMESTAMP` datetime NOT NULL, `LAST_USER` varchar(32) character set latin1 NOT NULL default '', `TEXT_1` text character set latin1 NOT NULL, PRIMARY KEY (`ACCOUNT_ID`), ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci The 'character_set_results' variable is reported as 'latin1', however this variable is not one that you can configure on the server. If I set the 'characterSetResults' property then the text is corrupted in exactly the same way. If I pass the following connection properties in then everything works fine: useOldUTF8Behavior=true useUnicode=true characterEncoding=UTF-8" Rob
[2 Apr 2009 14:48]
Rob Stone
The last line should of course read: characterEncoding=UTF-8
[2 Apr 2009 14:56]
Rob Stone
OK, looking at that I can see that the columns are set as 'character set latin1', we had used an alter table (via MySQL query browser) and had assumed that setting the table character set would also set the column character sets - I'm guessing this isn't the case and will probably (hopefully) explain our problems !
[2 Apr 2009 15:03]
Tonci Grgin
Rob, unfortunately I think this is not a bug and here's why: ... `TEXT_1` text character set latin1 NOT NULL, <<< ... ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci So you are trying to fetch UTF8 stream from table containing latin1 bytes, won't work. Of course, being that field in question is latin1, autodetection in c/J sets character_set_results to latin1 too. This is *exact* case why useOldUTF8Behavior was introduced. Now I wonder what is "CHARSET=utf8 COLLATE=utf8_unicode_ci" doing on table level as *all* fields are latin1...
[2 Apr 2009 15:16]
Rob Stone
Thank you Tonci, it turns out we have been making some very big incorrect assumptions about how the data is stored in the database. We have now figured out how to set the correct character sets on the table and once we re-insert the data it can then be retrieved correctly via JDBC. Do you know why the default charset is set to latin1 rather than utf-8 as this would have saved us so many problems. Thanks once again Rob
[3 Apr 2009 5:51]
Tonci Grgin
Yes Rob I know. Protocol is done that way, I guess, for historical reasons. So, for example, when connects libmysqlclient say "I R latin1" and does that using binary protocol on the wire. Thus you'll usually see 'SET NAMES something' command issued by any connector just to avoid default latin1 behavior of libmysqlclient. Glad the problem is solved.