MySQL Bugs: #4512: Wrong Character Set handling

Bug #4512	Wrong Character Set handling
Submitted:	12 Jul 2004 12:30	Modified:	28 Mar 2014 14:13
Reporter:	Heinz Doerr	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	Connector / J	Severity:	S2 (Serious)
Version:	3.0.14	OS:	Any (UCS or UTF enabled environment)
Assigned to:	Alexander Soklakov	CPU Architecture:	Any

Description:
The character set convertion (with a 4.0.x db) for a LOAD DATA statement is handled
different than with e.g. select statements. The result is: column names in the LOAD
DATA statement get's wrong translated for all non-ascii characters if not by accident
the client character set is the same as the server's convention.
Filename and Strings passed into the Connector/J API SHOULD HAVE NOTHING todo
with the local settings. Strings and char[] in Java are UCS2 and totally independant on
any local settings (like default local, flle.encoding, ...). Therefore the real translation
without taking care of broken and obsolete missuse of chars should be always

sqlString.getBytes(charset-of-the-server)

and NOT
-> sqlString.getBytes() (the default charset has here NO MEANING not for file names
and not for sql statements for Java >= 1.2 !!!)

I strictly believe, here should be :

sqlString.getBytes(severs-character-set) // this is the correct implementation

or if (that breaks important existing Java 1.1 code), for me it would be tollerable to use
the "normal" Connector/J translation.

By the way - I really like your speed improved charset translation. I made the same
experience, that by providing "own" methods could speed up the translation by quite
some. I've implemented an UTF-8 "performace improved" translation, too. If you think
that would be useful for Connector/J - I could provide the code.

The non LOAD DATA statements the Connector/J is doing the following translation:
sqlString.getBytes(the meaningless-local-charset) -> getBytes(charset-of-the-server)

This is in my opinion wrong but not deadly. It basically limits the charset to the local
machine's charset, e.g. you can't access a special char like a German Umlaut Ä
on a client machine which does not suport Umlaute even if the server would support it.
For interactive stuff - I guess - this is not a big limitation. For programmatic code, like
database backup and restore it's still a unneccessary limitation.

Actually, file content (bytes) like referenced in LOAD DATA LOCAL would need in
theory translation - but that's a total different story.

The bigger problem is that the Jdbc Connector handles characters in a way Sun
introduced char back in the '90 with Java 1.1. Starting with 1.2 (I guess) these
getBytes() and new String(bytes) stuff got depreciated but unfortunatelly Sun never
removed these methods from the API. Now we have SW out there which handles
chars as being unsigned bytes. Which they are NOT. This is not Java compliant since
1.2 !!! The Connector/J has done a lot to patch this situation which works sometimes,
but basically generates quite some trouble if you use char[] and String's as intended
by the API from Sun version >= 1.2 .

How to repeat:
- 4.0.x MySql db, latin1 charset
- Connector/J 3.0.14 on a UTF-8 or UCS2 enabled client (i my case Linux RH9.0 or
SuSe 9.1)
- "select `col§` from table_with_strange_column_name;"
works as expected, Connector/J converts the '§` to the correct latin1
representation
- the same with LOAD DATA [LOCAL] ... `col§`;
will not work,
because the convertion of the '§' char get's wong translated
due to the use of the depreciated String.getBytes() call in MySqlIO.java
Actually there is no workaround accept patching the connector.

Suggested fix:
Just remove these if /else special handling for the LOAD DATA stuff in
MySqlIO.writeBytes(...).
For my understanding the Connector/J does not support jvm's < 1.2 ???

[snip]
> Filename and Strings passed into the Connector/J API SHOULD HAVE NOTHING
> todo 
> with the local settings. Strings and char[] in Java are UCS2 and totally
> independant on 
> any local settings (like default local, flle.encoding, ...). Therefore the
> real translation 
> without taking care of broken and obsolete missuse of chars should be
> always 
[snip]

Unfortunately, the string passed in LOAD DATA LOCAL INFILE _does_ have something to do with the local encoding in the case when the MySQL client and the server do not have 'matching' character sets. 

When this is the case, when the driver transforms the strings to bytes, if the server's character set doesn't match the client, the string sent to the server is corrupted, thus causing the server to return the filename to load to the client in a corrupted state.

The reason the 'default' JVM character set is used in this case is because even the LOAD DATA LOCAL INFILE statement doesn't respect character sets, so we can't use an encoding like UCS-2 and send the Java string 'opaquely'. The 'default' is a 'best-guess', and works for most situations. The 'default' character set of the JVM almost always allows the filename to be parsed correctly.

If you are going to be 'mixing' character sets (i.e. JVM is different than MySQL server and/or the characters you place in your 'LOAD DATA LOCAL' statement), then we will have to expand the 'bugfix' to let you specify a character set to send LOAD DATA LOCAL queries to the server as a connection property. 

[snip]

> The bigger problem is that the Jdbc Connector handles characters in a way
> Sun 
> introduced char back in the '90 with Java 1.1. Starting with 1.2 (I guess)
> these 
> getBytes() and new String(bytes) stuff got depreciated but unfortunatelly
> Sun never 
> removed these methods from the API. Now we have SW out there which handles
> 
> chars as being unsigned bytes. Which they are NOT. This is not Java
> compliant since 
> 1.2 !!! The Connector/J has done a lot to patch this situation which works
> sometimes, 
> but basically generates quite some trouble if you use char[] and String's
> as intended 
> by the API from Sun version >= 1.2 . 

Neither String.getBytes() or new String(byte[]) are deprecated, at least not in any documentation from Sun that I have access to.

Could you please clarify your statement "The Connector/J has done a lot to patch this situation which works sometimes, but basically generates quite some trouble if you use char[] and String's as intended  by the API from Sun version >= 1.2" as I'm not sure if this is a comment, or actually part of the bug report.

I close this report as "Can't repeat" because there is no feedback for a long time and codebase is too old. Please, feel free to reopen it if the problem still exists in current driver.