Bug #4512 Wrong Character Set handling
Submitted: 12 Jul 2004 12:30 Modified: 28 Mar 2014 14:13
Reporter: Heinz Doerr Email Updates:
Status: Can't repeat Impact on me:
None 
Category:Connector / J Severity:S2 (Serious)
Version:3.0.14 OS:Any (UCS or UTF enabled environment)
Assigned to: Alexander Soklakov CPU Architecture:Any

[12 Jul 2004 12:30] Heinz Doerr
Description:
The character set convertion (with a 4.0.x db) for a LOAD DATA statement is handled 
different than with e.g. select statements. The result is: column names in the LOAD 
DATA statement get's wrong translated for all non-ascii characters if not by accident 
the client character set is the same as the server's convention. 
Filename and Strings passed into the Connector/J API SHOULD HAVE NOTHING todo 
with the local settings. Strings and char[] in Java are UCS2 and totally independant on 
any local settings (like default local, flle.encoding, ...). Therefore the real translation 
without taking care of broken and obsolete missuse of chars should be always 
 
	sqlString.getBytes(charset-of-the-server)  
 
and NOT 
-> sqlString.getBytes()  (the default charset has here NO MEANING not for file names 
and not for sql statements for Java >= 1.2 !!!) 
 
I strictly believe, here should be : 
 
	sqlString.getBytes(severs-character-set)  // this is the correct implementation 
 
or if (that breaks important existing Java 1.1 code), for me it would be tollerable to use 
the "normal" Connector/J translation.  
 
By the way - I really like your speed improved charset translation. I made the same 
experience, that by providing "own" methods could speed up the translation by quite 
some. I've implemented an UTF-8 "performace improved" translation, too. If you think 
that would be useful for Connector/J - I could provide the code. 
 
The non LOAD DATA statements the Connector/J is doing the following translation: 
sqlString.getBytes(the meaningless-local-charset) -> getBytes(charset-of-the-server) 
 
This is in my opinion wrong but not deadly. It basically limits the charset to the local 
machine's charset, e.g. you can't access a special char like a German Umlaut Ä 
on a client machine which does not suport Umlaute even if the server would support it. 
For interactive stuff  - I guess - this is not a big limitation. For programmatic code, like 
database backup and restore it's still a unneccessary limitation. 
 
Actually, file content (bytes) like referenced in LOAD DATA LOCAL would need in 
theory translation - but that's a total different story. 
 
The bigger problem is that the Jdbc Connector handles characters in a way Sun 
introduced char back in the '90 with Java 1.1. Starting with 1.2 (I guess) these 
getBytes() and new String(bytes) stuff got depreciated but unfortunatelly Sun never 
removed these methods from the API. Now we have SW out there which handles 
chars as being unsigned bytes. Which they are NOT. This is not Java compliant since 
1.2 !!! The Connector/J has done a lot to patch this situation which works sometimes, 
but basically generates quite some trouble if you use char[] and String's as intended 
by the API from Sun version >= 1.2 . 
 

How to repeat:
- 4.0.x MySql db, latin1 charset 
- Connector/J 3.0.14 on a UTF-8 or UCS2 enabled client (i my case Linux RH9.0 or 
SuSe 9.1) 
- "select `col§` from table_with_strange_column_name;" 
	works as expected, Connector/J converts  the '§` to the correct latin1 
	representation 
- the same with LOAD DATA [LOCAL] ... `col§`; 
	will not work, 
	because the convertion of the '§' char get's wong translated 
	due to the use of the depreciated String.getBytes() call in MySqlIO.java 
	Actually there is no workaround accept patching the connector. 
 
 

Suggested fix:
Just remove these if /else special handling for the LOAD DATA stuff in 
MySqlIO.writeBytes(...). 
For my understanding the Connector/J does not support jvm's < 1.2 ???
[12 Jul 2004 16:29] Mark Matthews
[snip]
> Filename and Strings passed into the Connector/J API SHOULD HAVE NOTHING
> todo 
> with the local settings. Strings and char[] in Java are UCS2 and totally
> independant on 
> any local settings (like default local, flle.encoding, ...). Therefore the
> real translation 
> without taking care of broken and obsolete missuse of chars should be
> always 
[snip]

Unfortunately, the string passed in LOAD DATA LOCAL INFILE _does_ have something to do with the local encoding in the case when the MySQL client and the server do not have 'matching' character sets. 

When this is the case, when the driver transforms the strings to bytes, if the server's character set doesn't match the client, the string sent to the server is corrupted, thus causing the server to return the filename to load to the client in a corrupted state.

The reason the 'default' JVM character set is used in this case is because even the LOAD DATA LOCAL INFILE statement doesn't respect character sets, so we can't use an encoding like UCS-2 and send the Java string 'opaquely'. The 'default' is a 'best-guess', and works for most situations. The 'default' character set of the JVM almost always allows the filename to be parsed correctly.

If you are going to be 'mixing' character sets (i.e. JVM is different than MySQL server and/or the characters you place in your 'LOAD DATA LOCAL' statement), then we will have to expand the 'bugfix' to let you specify a character set to send LOAD DATA LOCAL queries to the server as a connection property. 

[snip]

> The bigger problem is that the Jdbc Connector handles characters in a way
> Sun 
> introduced char back in the '90 with Java 1.1. Starting with 1.2 (I guess)
> these 
> getBytes() and new String(bytes) stuff got depreciated but unfortunatelly
> Sun never 
> removed these methods from the API. Now we have SW out there which handles
> 
> chars as being unsigned bytes. Which they are NOT. This is not Java
> compliant since 
> 1.2 !!! The Connector/J has done a lot to patch this situation which works
> sometimes, 
> but basically generates quite some trouble if you use char[] and String's
> as intended 
> by the API from Sun version >= 1.2 . 

Neither String.getBytes() or new String(byte[]) are deprecated, at least not in any documentation from Sun that I have access to.

Could you please clarify your statement "The Connector/J has done a lot to patch this situation which works sometimes, but basically generates quite some trouble if you use char[] and String's as intended  by the API from Sun version >= 1.2" as I'm not sure if this is a comment, or actually part of the bug report.
[28 Mar 2014 14:13] Alexander Soklakov
I close this report as "Can't repeat" because there is no feedback for a long time and codebase is too old. Please, feel free to reopen it if the problem still exists in current driver.