MySQL Bugs: #44045: UTF8 TEXT fields are being converted to CP1252

Bug #44045	UTF8 TEXT fields are being converted to CP1252
Submitted:	2 Apr 2009 14:17	Modified:	2 Apr 2009 15:16
Reporter:	Rob Stone	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	Connector / J	Severity:	S2 (Serious)
Version:	5.1.7 (Server version 5.0.34)	OS:	Windows
Assigned to:		CPU Architecture:	Any
Tags:	CP1252, text, utf8

Description:
We have a database with the following server variables set:
character_set_server=utf8
character_set_system=utf8
character_set_database=utf8
In addition we have created all the tables as UTF8.

Several of the table contain TEXT fields, and we are storing Japanese text (unicode) in these fields. If I try and retrieve a field via JDBC I end up with corrupted data. Tracing through the MySQL source reveals that the TEXT field character sets are being incorrectly selected as 'CP1252' instead of UTF8. If I set the connection property 'useOldUTF8Behavior' to true then the correct text is returned.

Again, tracing the code through I can see that the server is reporting a character set index of 8 for TEXT fields, whereas for CHAR fields it is reporting an index of 33.

How to repeat:
create a utf8 table containing a TEXT field, insert unicode characters into this field and then try and retrieve via JDBC, the characters are corrupted.

Suggested fix:
Setting the 'useOldUTF8Behavior' flag to true fixes the problem, but I'm a bit wary about using 'Old' behavior that looks like it was only intended for use with 4.0 servers.

Hi Rob and thanks for your report.

I do not think there's a bug here but rather misconfiguration. What does your character_set_results variable say? What is the exact structure of table in question? Mind you, if c-set is not set fully it will be inherited first from table then from database and so on.

So please attach *full* test case, *with* connection string.

Forgot to say about useOldUTF8Behavior:
  Using the UTF-8 Character Encoding - Prior to MySQL server version 4.1, the UTF-8 character encoding was not supported by the server, however the JDBC driver could use it, allowing storage of multiple character sets in latin1 tables on the server.

So probably thing is in table structure...

Hello Tonci,

Here is my table structure:
CREATE TABLE `dlr_accounts_ext_1` (

  `ACCOUNT_ID` varchar(32) character set latin1 NOT NULL default '',

  `TIMESTAMP` datetime NOT NULL,

  `LAST_USER` varchar(32) character set latin1 NOT NULL default '',

  `TEXT_1` text character set latin1 NOT NULL,

  PRIMARY KEY  (`ACCOUNT_ID`),

) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

The 'character_set_results' variable is reported as 'latin1', however this variable is not one that you can configure on the server. If I set the 'characterSetResults' property then the text is corrupted in exactly the same way.

If I pass the following connection properties in then everything works fine:
useOldUTF8Behavior=true
useUnicode=true
characterEncoding=UTF-8"

Rob

The last line should of course read:
characterEncoding=UTF-8

OK, looking at that I can see that the columns are set as 'character set latin1', we had used an alter table (via MySQL query browser) and had assumed that setting the table character set would also set the column character sets - I'm guessing this isn't the case and will probably (hopefully) explain our problems !

Rob, unfortunately I think this is not a bug and here's why:
...
  `TEXT_1` text character set latin1 NOT NULL,   <<<
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

So you are trying to fetch UTF8 stream from table containing latin1 bytes, won't work. Of course, being that field in question is latin1, autodetection in c/J sets character_set_results to latin1 too. This is *exact* case why useOldUTF8Behavior was introduced.

Now I wonder what is "CHARSET=utf8 COLLATE=utf8_unicode_ci" doing on table level as *all* fields are latin1...

Thank you Tonci, it turns out we have been making some very big incorrect assumptions about how the data is stored in the database. We have now figured out how to set the correct character sets on the table and once we re-insert the data it can then be retrieved correctly via JDBC. Do you know why the default charset is set to latin1 rather than utf-8 as this would have saved us so many problems.

Thanks once again
Rob

Yes Rob I know. Protocol is done that way, I guess, for historical reasons. So, for example, when connects libmysqlclient say "I R latin1" and does that using binary protocol on the wire. Thus you'll usually see 'SET NAMES something' command issued by any connector just to avoid default latin1 behavior of libmysqlclient.

Glad the problem is solved.