Bug #43819 JDBC - Unsupported unicode character(s)
Submitted: 24 Mar 2009 5:25 Modified: 21 Sep 2009 22:56
Reporter: Phil Cross Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: Charsets Severity:S1 (Critical)
Version:5.1.7 OS:Linux (Ubuntu - Hardy Heron)
Assigned to: Mark Matthews CPU Architecture:Any
Tags: jdbc, question mark, Unicode

[24 Mar 2009 5:25] Phil Cross
Description:
When I try to insert some unicode characters (like \u2192 - a right arrow), I get question marks upon retrieval.  When I use the command line client, I can make it work but this isn't a workaround since the system I am using can only be accessed through jdbc.  If I try to force the encoding to utf-8 by changing the URL from:

jdbc:mysql://localhost/test

to

jdbc:mysql://localhost/test?useUnicode=true&characterEncoding=UTF-8

I get an exception:

run:
inserting: →
java.sql.SQLException: Incorrect string value: '\xE2\x86\x92' for column 'str' at row 1
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1055)
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3515)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3447)
        at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1951)
        at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2101)
        at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2548)
        at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2477)
        at com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:741)
        at com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:587)
        at mysqltest.Main.main(Main.java:17)

When I use the plain URL, I get question marks instead of the actual character.

How to repeat:
NOTE: I have 2 Connection lines (one with the useUnicode parameters and one without.  Adjust the comment as needed for testing).

package mysqltest;

import java.sql.*;

public class Main {
    public static void main(String[] args) {
        try{
            //Make our statement
            //Connection connection = DriverManager.getConnection("jdbc:mysql://localhost/test", "test", "test");
            Connection connection = DriverManager.getConnection("jdbc:mysql://localhost/test?useUnicode=true&characterEncoding=UTF-8", "test", "test");
            Statement statement = connection.createStatement();
            String testString = "\u2192";

            statement.execute("drop table if exists test");
            statement.execute("create table test (id int, str varchar(50))");
            System.out.println("inserting: " + testString);
            statement.execute("insert into test (id, str) values (1, '" + testString + "')");
            ResultSet rs = statement.executeQuery("select str from test where id = 1");
            if(rs.next())
                System.out.println("selected: " + rs.getString("str"));
        }
        catch(Exception e){
            e.printStackTrace();
        }
    }

}
[24 Mar 2009 5:28] Phil Cross
BTW, I am using JDBC connector 5.1.7 and mysql version 5.0.51a-3ubuntu5.4 although I suspect that there are similar results with other versions of mysql because I have a BSD server with the same problem.  I am pretty sure this is a JDBC driver issue (if not user error or lack of documentation).
[24 Mar 2009 8:47] Tonci Grgin
Hi Phil and thanks for your report.

According to http://www.fileformat.info/info/unicode/char/2192/index.htm, E28692 is UTF-8 and 2192 is Unicode for right arrow.

Can you please try putting charset introducer in front of 2192 like this:
  SELECT (INSERT ...) _ucs2 0x2192
or use UTF8 representation E28692 and inform me of result.

mysql> show variables like "%character%";
+--------------------------+----------------------------------+
| Variable_name            | Value                            |
+--------------------------+----------------------------------+
| character_set_client     | utf8                             | 
| character_set_connection | utf8                             | 
| character_set_database   | utf8                             | 
| character_set_filesystem | binary                           | 
| character_set_results    | utf8                             | 
| character_set_server     | utf8                             | 
| character_set_system     | utf8                             | 
| character_sets_dir       | /opt/mysql/mysql/share/charsets/ | 
+--------------------------+----------------------------------+
8 rows in set (0.00 sec)

mysql> SELECT _utf8 0xE28692;
+----------------+
| _utf8 0xE28692 |
+----------------+
| →            | 
+----------------+
1 row in set (0.03 sec)

mysql> SELECT _ucs2 0x2192;
+--------------+
| _ucs2 0x2192 |
+--------------+
| →          | 
+--------------+
1 row in set (0.00 sec)

Output of show variables like "%character%"; and show create table table_with_problem; might prove useful too.

Please also check general server log at the time your c/J test starts.
[24 Mar 2009 15:11] Phil Cross
I understand that the unicode character 2192 is 0xE28692 in utf-8.  The exception in my original post shows that mysql (or the jdbc driver) is reporting that \xE2\x86\x92 is an invalid sequence.  When I work with the command line as you describe in your message, everything works fine.  It only fails when going through the JDBC driver.  The problem is that the system I work with can only interact through jdbc, so I can't use a solution where I operate directly in the mysql client via command line.  I have to have a programming solution in java or a new connector JAR file (which may or may not happen).  I may try some previous jdbc drivers and see if the problem has always existed or if it is specific to this 5.1.7 release.
[24 Mar 2009 15:19] Tonci Grgin
Phil, I do not understand your last post... Did you tried slapping charset introducer in front of your query (statement.execute("insert into test (id, str) values (1, _ucs2 '"+ testString + "')");)? Did you checked logs? Did you provided variables and structures as asked?

I still do not see a bug here, just illegal mixing of charsets/collations resulting in proper error with expected UTF8 code for symbol.
[24 Mar 2009 15:40] Phil Cross
Sorry, my log is empty.  I changed the code in my example from:

statement.execute("insert into test (id, str) values (1, '" + testString + "')");

to

statement.execute("insert into test (id, str) values (1, _ucs2 '" + testString + "')");

and I still get output of:

run:
inserting: →
selected: ?

I also reverted the connection lines to:

Connection connection = DriverManager.getConnection("jdbc:mysql://localhost/test", "test", "test");
//Connection connection = DriverManager.getConnection("jdbc:mysql://localhost/test?useUnicode=true&characterEncoding=UTF-8", "test", "test");

When I use the "useUnicode" parameter:

//Connection connection = DriverManager.getConnection("jdbc:mysql://localhost/test", "test", "test");
Connection connection = DriverManager.getConnection("jdbc:mysql://localhost/test?useUnicode=true&characterEncoding=UTF-8", "test", "test");

I get a stranger exception (not sure how it came up with x86, x92):

run:
inserting: →
java.sql.SQLException: Incorrect string value: '\x86\x92' for column 'str' at row 1
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1055)
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3515)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3447)
        at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1951)
        at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2101)
        at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2548)
        at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2477)
        at com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:741)
        at com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:587)
        at mysqltest.Main.main(Main.java:17)

Finally, you say that "I still do not see a bug here, just illegal mixing of
charsets/collations resulting in proper error with expected UTF8 code
for symbol."

I disagree (and maybe I am misunderstanting the error message).  It seems to me that if I have a string "\u2192" and I pass it to mysql and it says:

java.sql.SQLException: Incorrect string value: '\xE2\x86\x92' for
column 'str' at row 1

That sounds to me like a proper unicode to utf8 conversion occurred and mysql is rejecting it.  I never told it about E28692 but it came up with it on its own (as it should have).  So, why is it telling me that it's an "incorrect string value"?
[24 Mar 2009 16:09] Tonci Grgin
A simple test shows this character is represented with more than 3 bytes in UTF-8 thus I can only suggest you to try MySQL server 6 as older versions have incomplete implementation for such long UTF8 streams...

	System.out.println("→".getBytes("utf-8").length);	

Can you test with MySQL server 6.0 and see if it works?
[24 Mar 2009 16:20] Phil Cross
Actually, this character is exactly 3 bytes (not more than 3) does that lead to the same answer?  If so, I can try mysql 6 but that won't be a quick test, so I want to verify with you.

Code to verify the length (your code had an extra character in the previous post):
System.out.println("\u2192".getBytes("utf-8").length);

prints "3"
[24 Mar 2009 16:24] Tonci Grgin
Phil, true, just saw it in e-mail.

Verified as described.
[20 Aug 2009 19:53] Mark Matthews
Looking at the bytes sent by the driver, the UTF-8 encoding is correct. It's the server that is complaining. I'm assigning this to the server, however it's easy to test locally by setting the sql_mode to 'strict_trans_tables' before you issue the insert in the CLI. I get the same error as the JDBC driver does when do that (with a relatively recent 5.0, server, but not 5.1).

Without 'strict_trans_tables' enabled the server will still store the bytes, but not issue an error. The JDBC driver sets 'strict_trans_tables' because the behavior given in that mode is required by the JDBC API specification (exceptions on truncation).
[21 Aug 2009 22:07] Peter Gulutzan
What's the result of

select character_set_name
from information_schema.columns
where table_name='test'and column_name='str';

?
[21 Sep 2009 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".