MySQL Bugs: #61139: Introduce a method to handle tables storing data with incorrect encoding

Bug #61139	Introduce a method to handle tables storing data with incorrect encoding
Submitted:	12 May 2011 0:02	Modified:	12 May 2011 1:19
Reporter:	Daniel Webster	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	Connector / J	Severity:	S3 (Non-critical)
Version:	5.1.15	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	character_encoding

Description:
Our system has a database schema which has tables which are marked as using Latin-1 character encoding. However, the data in those tables is actually UTF-8 data entered by a Perl system.

We cannot change the tables to UTF-8 tables in our MySql DB, due to other organizational/historical impacts.

We are using the 5.1.15 Connector/J jar to retrieve that data, and we need to be able to override the database driver's detection of the character encoding. In effect, we need to be able to tell the database driver "you are detecting the encoding from the table meta-data, but use this encoding instead."

We have patched our version of the connector with a hack to fix this in our system, but it would be great if there was a way to configure the driver such that the patch's hack was not required. Ideally, this override configuration could be done globally, per table, or even per table column.

How to repeat:
1) Create a table in MySql with Latin-1 encoding.

2) Insert UTF-8 data into that table using some method that is not sensitive to character encoding. Make sure to include special characters such as theTrademark, Registered Trademark, and Copyright symbols.

3) Retrieve that data from the table using Connector/J. Notice that the special symbols have been replaced by the ? character.

Suggested fix:
Create a configuration for the driver that allows character-encoding override at every scope (by DB, by table, by schema, by column).

This could be a connection string parameter, or properties file.

This feature already exists connection-wide, see the property "useOldUTF8Behavior" in the docs at http://dev.mysql.com/doc/refman/5.5/en/connector-j-reference-configuration-properties.html

Doing this on any other scope is really tough, as the driver doesn't actually ever fully parse *any* SQL, so it doesn't really know what database, table or column is "in scope".

useOldUTF8Behavior: 	Use the UTF-8 behavior the driver did when communicating with 4.0 and older servers

That is what is in the documentation. Does this mean "override character encoding found in the database"?

Also, this "the driver doesn't actually ever fully parse *any* SQL" makes me think you are talking about the encoding of values in the SQL statement, which would affect the values in SQL where clauses, but I was mostly referring to the character encodings of values returned in a result set... that is, when I do a rs.getString(1); what character encoding is used to turn the bytes into Strings

Daniel,

The issue as I see it as a solution that scopes by database and table name needs to be *complete*, i.e. for reading-and-writing.

In any case, the "old" way is to do exactly what you state. The driver tells the database that it's speaking "latin1", but really sending and reading UTF-8.