MySQL Bugs: #36780: incorrect utf8 behavior with mysql 5.1.22 and jdbc driver 5.1.5

Bug #36780	incorrect utf8 behavior with mysql 5.1.22 and jdbc driver 5.1.5
Submitted:	17 May 2008 21:25	Modified:	20 Jul 2009 9:52
Reporter:	Chris Kiernan	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	Connector / J	Severity:	S2 (Serious)
Version:	5.1.5	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
I'm seeing strange behavior with utf-8 characters coming from java into my database.  The jdbc driver can read and write utf-8 strings, however nobody else can read them from the database.  In other words it looks like the jdbc driver is using a string to byte conversion algorithm that works is not compatible with the native database format.

My url parameters are as follows:

rewriteBatchedStatements=true&useServerPrepStmts=false&zeroDateTimeBehavior=round&jdbcCompliantTruncation=true&useUnicode=true&characterEncoding=UTF-8&profileSQL=true

I see this both on a mac and on linux, both at 5.1.22.  I tried this on 5.1.24 on the mac as well, same result.  

The string I'm using from within java is '不一样的星期五'.  The result I get from the mysql command line client is '‰∏ç‰∏ÄÊ†∑ÁöÑÊòüÊúü‰∫î'.  From this latter string, java can evidently decode it forwards and backwards to the former.

How to repeat:
Java Test code as follows (which I'll also include in an enclosure):

String name = "不一样的星期五";
Connection conn = null;
PreparedStatement pstmt = null;
Statement stmt = null;
ResultSet rs = null;
try {
    // swap this for however you need to create your connection
    conn = dataSource.getConnection();
    stmt = conn.createStatement();
    rs = stmt.executeQuery("show session variables like '%char%'");
    System.out.println("----------------------");
    while (rs.next()) {
        System.out.println(rs.getString(1) + ": " + rs.getString(2));
    }
    System.out.println("----------------------");
    stmt.executeUpdate("drop table if exists test_model");
    stmt.executeUpdate("create table test_model ( "
    + "test_id integer not null auto_increment, "
    + "name varchar(32), "
    + "primary key (test_id)"
    + " ) charset=utf8");
        
    pstmt = conn.prepareStatement("insert into test_model (test_id, name) values (?, ?)");
    pstmt.setInt(1, 100);
    pstmt.setString(2, name);
    pstmt.executeUpdate();
    conn.commit();
    rs = stmt.executeQuery("select name from test_model where test_id = 100");
    rs.next();
    String s = rs.getString(1);
    System.out.println("----- name: " + s);
    // works fine
    assertEquals("name", name, s);
} finally {
    try { if (rs != null) rs.close(); } catch (SQLException e) { /* na */ }
    try { if (stmt != null) stmt.close(); } catch (SQLException e) { /* na */ }
    try { if (pstmt != null) pstmt.close(); } catch (SQLException e) { /* na */ }
    try { if (conn != null) conn.close(); } catch (SQLException e) { /* na */ }
}

Output from the test is as follows:

----------------------
character_set_client: utf8
character_set_connection: utf8
character_set_database: utf8
character_set_filesystem: binary
character_set_results: 
character_set_server: latin1
character_set_system: utf8
character_sets_dir: /usr/local/mysql-5.1.22-rc-osx10.4-i686/share/mysql/charsets/
----------------------
----- name: 不一样的星期五

Meaning that java can read and write the data correctly.  However, when I log in via the mysql 
command line client, I see the following:

ck@localhost riskiq_test> set names utf8;
Query OK, 0 rows affected (0.00 sec)

ck@localhost riskiq_test> select * from test_model;
+---------+----------------------------------------------------+
| test_id | name                                               |
+---------+----------------------------------------------------+
|     100 | ‰∏ç‰∏ÄÊ†∑ÁöÑÊòüÊúü‰∫î | 
+---------+----------------------------------------------------+
1 row in set (0.01 sec)

ck@localhost riskiq_test> insert into test_model (test_id, name) values (101, '不一样的星期五');
Query OK, 1 row affected (0.00 sec)

ck@localhost riskiq_test> select * from test_model;
+---------+----------------------------------------------------+
| test_id | name                                               |
+---------+----------------------------------------------------+
|     100 | ‰∏ç‰∏ÄÊ†∑ÁöÑÊòüÊúü‰∫î | 
|     101 | 不一样的星期五                              | 
+---------+----------------------------------------------------+
2 rows in set (0.00 sec)
 
I've enclosed both the profiling output and the mysql query log, which report that the jdbc driver 
is apparently sending the incorrect bytes back to the server.  Witness (from the profiling output)

Sat May 17 14:00:05 PDT 2008 INFO: Profiler Event: [QUERY] 	at org.apache.commons.dbcp.DelegatingPreparedStatement.executeUpdate(DelegatingPreparedStatement.java:102) duration: 1 ms, connection-id: 689, statement-id: 8, resultset-id: 0, message: insert into test_model (test_id, name) values (100, '‰∏ç‰∏ÄÊ†∑ÁöÑÊòüÊúü‰∫î')

Also from the query log:

		  689 Query	insert into test_model (test_id, name) values (100, '‰∏ç‰∏ÄÊ†∑ÁöÑÊòüÊúü‰∫î')
		  
Here's what the query log looks like from my correct insert above via the command line client:

080517 14:01:29  677 Query      insert into test_model (test_id, name) values (101, '不一样的星期五')

test / profiler output

Attachment: unicode-test.log (application/octet-stream, text), 16.52 KiB.

server query log output

Attachment: unicode-db.log (application/octet-stream, text), 2.86 KiB.

java test file. You'll probably have to mod this to get your connection

Attachment: UnicodeTest.java (text/plain), 2.86 KiB.

also tried with connector version 5.1.6, same beahvior.

The JDBC driver uses the UTF-8 encoding built in to Java. Do you happen to know if any of the characters in that string use more than 3 bytes in UTF-8? If so, MySQL can't handle them correctly.

wow huh I hadn't thought about that.  But it doesn't look like the case.  I'll enclose a little beanshell script that shows the bytes that java generates for each character in the string UTF-8, at least.  That's the best way I know how to check for something like that.  

Here's the output of the script:
不一样的星期五: length: 21
[0] char: ? length: 3: bytes: [-30, -128, -80]
[1] char: ? length: 3: bytes: [-30, -120, -113]
[2] char: ? length: 2: bytes: [-61, -89]
[3] char: ? length: 3: bytes: [-30, -128, -80]
[4] char: ? length: 3: bytes: [-30, -120, -113]
[5] char: ? length: 2: bytes: [-61, -124]
[6] char: ? length: 2: bytes: [-61, -118]
[7] char: ? length: 3: bytes: [-30, -128, -96]
[8] char: ? length: 3: bytes: [-30, -120, -111]
[9] char: ? length: 2: bytes: [-61, -127]
[10] char: ? length: 2: bytes: [-61, -74]
[11] char: ? length: 2: bytes: [-61, -111]
[12] char: ? length: 2: bytes: [-61, -118]
[13] char: ? length: 2: bytes: [-61, -78]
[14] char: ? length: 2: bytes: [-61, -68]
[15] char: ? length: 2: bytes: [-61, -118]
[16] char: ? length: 2: bytes: [-61, -70]
[17] char: ? length: 2: bytes: [-61, -68]
[18] char: ? length: 3: bytes: [-30, -128, -80]
[19] char: ? length: 3: bytes: [-30, -120, -85]
[20] char: ? length: 2: bytes: [-61, -82]

Also I did some more testing and wrote a perl script which seems to handle these characters correctly.  

So the full test plan would be:

1. run the java test I enclosed earlier, which inserts row with test_id 100
2. log into mysql command line and do: 
   set names utf8;
   insert into test_model (test_id, name) values (101, '不一样的星期五');
3. run the perl script, which inserts row with test_id 102, then prints out all the rows in the table.  100 and 102 look fine from perl.
4. log back into the mysql command line, and do:
   select * from test_model where test_id in (100, 101, 102);

and you get

+---------+----------------------------------------------------+------+
| test_id | name                                               | len  |
+---------+----------------------------------------------------+------+
|     100 | ‰∏ç‰∏ÄÊ†∑ÁöÑÊòüÊúü‰∫î |   50 | 
|     101 | 不一样的星期五                              |   21 | 
|     102 | 不一样的星期五                              |   21 | 
+---------+----------------------------------------------------+------+

beanshell script to show character bytes

Attachment: unicode_test.bsh (application/octet-stream, text), 401 bytes.

perl script which inserts characters correctly

Attachment: test.pl (text/plain), 647 bytes.

oops, that very last sql statement should have been:

select test_id, name, length(name) as len from test_model where test_id in (100, 101, 102);

Just a comment:

"The result I get from the mysql command line client is '‰∏ç‰∏ÄÊ†∑ÁöÑÊòüÊúü‰∫î'."

It is rather important on what platform you have comman line client running. Do not expect that to print as CJK characters in Windows.

try Query Browser or (better!) SQLyog to have a client interface that supports the full unicode range client side if client machine is Windows!

I'm testing on a mac using mysql command line client with the terminal app, which supports UTF-8 without trouble.  I also tried the mysql query browser- it produced the same results as the command line client- i.e. java seems wrong.  Your result is the same wrong result as I get when I insert from java.  When I insert from perl or the mysql command line client I get the correct result, as shown in the above logs and examples.

Hi Chris and thanks for your report.

I am not a Java nor charset guru but my job is to be first one checking on connectors bug reports. Now it is my opinion that there is no bug here and here's why. In on your how-to I see this:
 - Server variables:
character_set_results:              <<
character_set_server: latin1    <<
Then you set session cset in cl client:
2. log into mysql command line and do: 
   set names utf8;
which is something c/J doesn't do (and setting session encoding this way is no. 1 charset mistake)! Let's look at general query log c/J produces:
080512 10:08:32	      7 Connect     root@localhost on test
		      7 Query       /* @MYSQL_CJ_FULL_PROD_NAME@ ( Revision: @MYSQL_CJ_REVISION@ ) */SHOW VARIABLES WHERE Variable_name ='language' OR Variable_name = 'net_write_timeout' OR Variable_name = 'interactive_timeout' OR Variable_name = 'wait_timeout' OR Variable_name = 'character_set_client' OR Variable_name = 'character_set_connection' OR Variable_name = 'character_set' OR Variable_name = 'character_set_server' OR Variable_name = 'tx_isolation' OR Variable_name = 'transaction_isolation' OR Variable_name = 'character_set_results' OR Variable_name = 'timezone' OR Variable_name = 'time_zone' OR Variable_name = 'system_time_zone' OR Variable_name = 'lower_case_table_names' OR Variable_name = 'max_allowed_packet' OR Variable_name = 'net_buffer_length' OR Variable_name = 'sql_mode' OR Variable_name = 'query_cache_type' OR Variable_name = 'query_cache_size' OR Variable_name = 'init_connect'
		      7 Query       SHOW COLLATION
		      7 Query       SET character_set_results = NULL
		      7 Query       SET autocommit=1
		      7 Query       SELECT VERSION()
There is absolutely no "SET NAMES..." there. All of the connectors try to 

Now what I believe is happening is that you send UTF8 bytes which are seen as latin1 characters and when field encoding and session encoding mismatch, a conversion is done. In short, it appears to me you're treating data as utf8 without marking it as such.

What I would like you to do is to examine value of "characterEncoding", which is "autodetect" by default, and see if it's correct. Or even force it in connection string with "characterEncodingUTF-8" and retest.

Please ignore "All of the connectors try to "... I meant to put in a word or two on this but decided not to over-complicate.

Umm, I see you do use "characterEncoding=UTF-8" but that does not change my opinion, double encoding happens here...

ok, so what would you like me to test then?  I've tried the following above where I do the insert- neither makes a difference:

stmt.executeUpdate("set names utf8");
stmt.executeUpdate("set character_set_results=utf8");

Both will turn off character set results autonegotiation, but my insert still ends up with the incorrect string, i.e. '‰∏ç‰∏ÄÊ†∑ÁöÑÊòüÊúü‰∫î'

Chris, we are currently analyzing possibility that there is a bug here like described.

oh ok sorry, I saw the needs feedback and thought you wanted something else from me.  Thanks Tonci...

Chris, yes there was a request, thanks for being attentive. But for now, we would first like to check this problem from three different angles that could have caused it.

Many thanks for writing a bug report.

Unfortunately, this is not a bug. At the moment MySQL server don't support 4-byte UTF8. This is planned for MySQL 6.x version.

Additionally, Windows has a problem with displaying UTF8. Seems you run into this too.