Bug #36780 | incorrect utf8 behavior with mysql 5.1.22 and jdbc driver 5.1.5 | ||
---|---|---|---|
Submitted: | 17 May 2008 21:25 | Modified: | 20 Jul 2009 9:52 |
Reporter: | Chris Kiernan | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | Connector / J | Severity: | S2 (Serious) |
Version: | 5.1.5 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[17 May 2008 21:25]
Chris Kiernan
[17 May 2008 21:26]
Chris Kiernan
test / profiler output
Attachment: unicode-test.log (application/octet-stream, text), 16.52 KiB.
[17 May 2008 21:26]
Chris Kiernan
server query log output
Attachment: unicode-db.log (application/octet-stream, text), 2.86 KiB.
[17 May 2008 21:28]
Chris Kiernan
java test file. You'll probably have to mod this to get your connection
Attachment: UnicodeTest.java (text/plain), 2.86 KiB.
[17 May 2008 21:31]
Chris Kiernan
also tried with connector version 5.1.6, same beahvior.
[18 May 2008 1:43]
Mark Matthews
The JDBC driver uses the UTF-8 encoding built in to Java. Do you happen to know if any of the characters in that string use more than 3 bytes in UTF-8? If so, MySQL can't handle them correctly.
[18 May 2008 2:51]
Chris Kiernan
wow huh I hadn't thought about that. But it doesn't look like the case. I'll enclose a little beanshell script that shows the bytes that java generates for each character in the string UTF-8, at least. That's the best way I know how to check for something like that. Here's the output of the script: 不一样的星期五: length: 21 [0] char: ? length: 3: bytes: [-30, -128, -80] [1] char: ? length: 3: bytes: [-30, -120, -113] [2] char: ? length: 2: bytes: [-61, -89] [3] char: ? length: 3: bytes: [-30, -128, -80] [4] char: ? length: 3: bytes: [-30, -120, -113] [5] char: ? length: 2: bytes: [-61, -124] [6] char: ? length: 2: bytes: [-61, -118] [7] char: ? length: 3: bytes: [-30, -128, -96] [8] char: ? length: 3: bytes: [-30, -120, -111] [9] char: ? length: 2: bytes: [-61, -127] [10] char: ? length: 2: bytes: [-61, -74] [11] char: ? length: 2: bytes: [-61, -111] [12] char: ? length: 2: bytes: [-61, -118] [13] char: ? length: 2: bytes: [-61, -78] [14] char: ? length: 2: bytes: [-61, -68] [15] char: ? length: 2: bytes: [-61, -118] [16] char: ? length: 2: bytes: [-61, -70] [17] char: ? length: 2: bytes: [-61, -68] [18] char: ? length: 3: bytes: [-30, -128, -80] [19] char: ? length: 3: bytes: [-30, -120, -85] [20] char: ? length: 2: bytes: [-61, -82] Also I did some more testing and wrote a perl script which seems to handle these characters correctly. So the full test plan would be: 1. run the java test I enclosed earlier, which inserts row with test_id 100 2. log into mysql command line and do: set names utf8; insert into test_model (test_id, name) values (101, '不一样的星期五'); 3. run the perl script, which inserts row with test_id 102, then prints out all the rows in the table. 100 and 102 look fine from perl. 4. log back into the mysql command line, and do: select * from test_model where test_id in (100, 101, 102); and you get +---------+----------------------------------------------------+------+ | test_id | name | len | +---------+----------------------------------------------------+------+ | 100 | ‰∏ç‰∏Äʆ∑ÁöÑÊòüÊúü‰∫î | 50 | | 101 | 不一样的星期五 | 21 | | 102 | 不一样的星期五 | 21 | +---------+----------------------------------------------------+------+
[18 May 2008 2:52]
Chris Kiernan
beanshell script to show character bytes
Attachment: unicode_test.bsh (application/octet-stream, text), 401 bytes.
[18 May 2008 2:52]
Chris Kiernan
perl script which inserts characters correctly
Attachment: test.pl (text/plain), 647 bytes.
[18 May 2008 2:53]
Chris Kiernan
oops, that very last sql statement should have been: select test_id, name, length(name) as len from test_model where test_id in (100, 101, 102);
[18 May 2008 17:22]
Peter Laursen
Just a comment: "The result I get from the mysql command line client is '不一样的星期五'." It is rather important on what platform you have comman line client running. Do not expect that to print as CJK characters in Windows.
[18 May 2008 17:25]
Peter Laursen
try Query Browser or (better!) SQLyog to have a client interface that supports the full unicode range client side if client machine is Windows!
[19 May 2008 3:47]
Chris Kiernan
I'm testing on a mac using mysql command line client with the terminal app, which supports UTF-8 without trouble. I also tried the mysql query browser- it produced the same results as the command line client- i.e. java seems wrong. Your result is the same wrong result as I get when I insert from java. When I insert from perl or the mysql command line client I get the correct result, as shown in the above logs and examples.
[19 May 2008 8:54]
Tonci Grgin
Hi Chris and thanks for your report. I am not a Java nor charset guru but my job is to be first one checking on connectors bug reports. Now it is my opinion that there is no bug here and here's why. In on your how-to I see this: - Server variables: character_set_results: << character_set_server: latin1 << Then you set session cset in cl client: 2. log into mysql command line and do: set names utf8; which is something c/J doesn't do (and setting session encoding this way is no. 1 charset mistake)! Let's look at general query log c/J produces: 080512 10:08:32 7 Connect root@localhost on test 7 Query /* @MYSQL_CJ_FULL_PROD_NAME@ ( Revision: @MYSQL_CJ_REVISION@ ) */SHOW VARIABLES WHERE Variable_name ='language' OR Variable_name = 'net_write_timeout' OR Variable_name = 'interactive_timeout' OR Variable_name = 'wait_timeout' OR Variable_name = 'character_set_client' OR Variable_name = 'character_set_connection' OR Variable_name = 'character_set' OR Variable_name = 'character_set_server' OR Variable_name = 'tx_isolation' OR Variable_name = 'transaction_isolation' OR Variable_name = 'character_set_results' OR Variable_name = 'timezone' OR Variable_name = 'time_zone' OR Variable_name = 'system_time_zone' OR Variable_name = 'lower_case_table_names' OR Variable_name = 'max_allowed_packet' OR Variable_name = 'net_buffer_length' OR Variable_name = 'sql_mode' OR Variable_name = 'query_cache_type' OR Variable_name = 'query_cache_size' OR Variable_name = 'init_connect' 7 Query SHOW COLLATION 7 Query SET character_set_results = NULL 7 Query SET autocommit=1 7 Query SELECT VERSION() There is absolutely no "SET NAMES..." there. All of the connectors try to Now what I believe is happening is that you send UTF8 bytes which are seen as latin1 characters and when field encoding and session encoding mismatch, a conversion is done. In short, it appears to me you're treating data as utf8 without marking it as such. What I would like you to do is to examine value of "characterEncoding", which is "autodetect" by default, and see if it's correct. Or even force it in connection string with "characterEncodingUTF-8" and retest.
[19 May 2008 8:56]
Tonci Grgin
Please ignore "All of the connectors try to "... I meant to put in a word or two on this but decided not to over-complicate.
[19 May 2008 9:36]
Tonci Grgin
Umm, I see you do use "characterEncoding=UTF-8" but that does not change my opinion, double encoding happens here...
[19 May 2008 12:09]
Chris Kiernan
ok, so what would you like me to test then? I've tried the following above where I do the insert- neither makes a difference: stmt.executeUpdate("set names utf8"); stmt.executeUpdate("set character_set_results=utf8"); Both will turn off character set results autonegotiation, but my insert still ends up with the incorrect string, i.e. '不一样的星期五'
[19 May 2008 12:53]
Tonci Grgin
Chris, we are currently analyzing possibility that there is a bug here like described.
[19 May 2008 13:34]
Chris Kiernan
oh ok sorry, I saw the needs feedback and thought you wanted something else from me. Thanks Tonci...
[19 May 2008 17:31]
Tonci Grgin
Chris, yes there was a request, thanks for being attentive. But for now, we would first like to check this problem from three different angles that could have caused it.
[20 Jul 2009 9:52]
Susanne Ebrecht
Many thanks for writing a bug report. Unfortunately, this is not a bug. At the moment MySQL server don't support 4-byte UTF8. This is planned for MySQL 6.x version. Additionally, Windows has a problem with displaying UTF8. Seems you run into this too.