Bug #25528 | Connector/J Not Handling Unicode (UTF8) Properly | ||
---|---|---|---|
Submitted: | 10 Jan 2007 18:02 | Modified: | 26 Jul 2007 17:11 |
Reporter: | Paul Dlug | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | Connector / J | Severity: | S1 (Critical) |
Version: | 5.0.4 | OS: | Any (Any) |
Assigned to: | Alexander Soklakov | CPU Architecture: | Any |
[10 Jan 2007 18:02]
Paul Dlug
[19 Jan 2007 10:15]
Tonci Grgin
Hi Paul and thanks for your report. I was unable to verify described behavior. I took "reverse" test since I don't use pearl/ruby. Environment: - MySQL 5.0.34BK on WinXP Pro SP2 localhost - c/J 5.0 latest sources - JDK 1.5.0_10 Test: create database testutf DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci; Make Java test case using Eclipse, default VM arguments: -Xmx256M -Dcom.mysql.jdbc.testsuite.url.default=jdbc:mysql://localhost:3306/testutf?user=root&password=&useServerPrepStmts=false&jdbcCompliantTruncation=true public void testBug25528() throws Exception { try { Properties props = new Properties(); props.put("useUnicode", "true"); props.put("characterEncoding", "utf8"); Connection conn2 = getConnectionWithProps(props); Statement stmt2 = conn2.createStatement(); stmt2.executeUpdate("DROP TABLE IF EXISTS testBug25528"); stmt2.executeUpdate("CREATE TABLE testBug25528(" + "Id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY," + " translationValue char(255)" + "\n)ENGINE=InnoDB DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;"); stmt2.executeUpdate("insert into testBug25528 values(NULL,'Mössbauer Effect')"); this.rs = stmt2.executeQuery("SELECT * FROM testBug25528"); assertEquals(true, this.rs.next()); System.out.print(this.rs.getObject(1)); System.out.print(this.rs.getObject(2)); System.out.print('\t'); System.out.println(); } finally { closeMemberJDBCResources(); } } Run Java test case: .Loading JDBC driver 'com.mysql.jdbc.Driver' Done. Done. Connected to 5.0.34-log 1Mössbauer Effect Time: 0,484 OK (1 test) -------------------------------------------------------------- Run select in mysql cl client: C:\mysql507\bin>mysql -uroot testutf Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 3 Server version: 5.0.34-log Source distribution Type 'help;' or '\h' for help. Type '\c' to clear the buffer. mysql> select * from testbug25528; +----+------------------+ | Id | translationValue | +----+------------------+ | 1 | Mössbauer Effect | +----+------------------+ 1 row in set (0.00 sec) mysql> show variables like "%char%"; +--------------------------+-----------------------------+ | Variable_name | Value | +--------------------------+-----------------------------+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | latin1 | | character_set_server | utf8 | | character_set_system | utf8 | | character_sets_dir | C:\mysql507\share\charsets\ | +--------------------------+-----------------------------+ 8 rows in set (0.00 sec) -------------------------------------------------------------- Run "independent" tool to check what's written: See attached image from MySQL QB If you have any more info to add to this report, please reopen it.
[19 Jan 2007 10:16]
Tonci Grgin
Same select in QB
Attachment: 25528.jpg (image/pjpeg, text), 60.43 KiB.
[29 Jun 2007 4:19]
Rebecca Berrigan
Greetings I have reproduced this bug using MySQL 4.1.18/20 and Connector/J 5.0/3.1 on Linux (CentOS) and MacOSX(10.4). The bug arises *specifically* when UTF-8 characters (I was testing using Japanese words) are inserted into utf-8 tables (collation utf8_general_ci) using Perl OR the command line client (cut/paste UTF-8), and is then retrieved using Connector/J-Java. I have walked through the Connector/J source to follow the select at low levels - I am satisfied that Connector/J is not reformatting the bytes to produce garbage - the garbage seems to be emerging from the db, in what I am realising is perhaps a discrepancy in the insert model. Strangely enough, keeping tables encoded as latin1, and asking Java to do the classic 'new String(rs.getBytes("column_name"), "UTF-8"));' ploy works well. This bug has *not* been reproduced properly if you are using only Java - we know that if Java-Connector/J inserts UTF-8, then Java-Connector/J retrieves it well. You must reproduce this bug inserting using Perl or Ruby - if you do not code in Perl or Ruby please can we find someone who can
[26 Jul 2007 15:03]
Paul Dlug
I think you are correct in your reproduction. I am still having problems with this despite trying numerous variations on the new String(rs.getBytes(""), "UTF-8"). Does anyone have any other ideas?
[26 Jul 2007 17:03]
Paul Dlug
Scripts to replicate the problem
Attachment: MySQLUnicode.tar.gz (application/x-gzip, text), 498.59 KiB.
[26 Jul 2007 17:11]
Paul Dlug
I just attached a bundle of code to replicate the problem. Run it like this: Create the database and tables: mysql -uroot < setup.sql Use perl to insert/query back the table and test that the UTF-8 is well formed: perl run-perl-test.pl Use ruby to insert/query back the table and test that the UTF-8 is well formed: ruby run-ruby-test.rb Use java to insert/query back the table: ./run-java-test.sh You can keep running the scripts in any order they each insert a distinct UTF-8 string so you can see their result: Perl: Mössbauer Ruby: Müller Java: Hjörvarsson You'll find in running the tests that ruby, perl, and the command line mysql client are all able to see the correct UTF-8 characters but see the rows inserted by java as mangled but valid UTF-8. Java gets mangled results for all but the rows it inserts. Requirements: - Perl: DBI/DBD::MySQL - Ruby: ruby mysql and active_support (part of rails) for the nice unicode functions - Java: any java runtime >= 1.5
[13 Aug 2007 17:15]
Luca Rondanini
I'm experiencing the same problem with PHP: (mysql 5.0.18, java 1.5, Connectors 5.0->5.1) ----------------------------------------------------------------------------------------- table: ----------------------------------------------------------------------------------------- CREATE TABLE `TMS_DEV` ( `segment` varchar(20) collate utf8_general_ci NOT NULL ) ENGINE=MyISAM DEFAULT CHARSET=utf8_general_ci COLLATE=utf8_general_ci; ----------------------------------------------------------------------------------------- in PHP: ----------------------------------------------------------------------------------------- $mysql_hostname = ""; // Database Server machine $mysql_database = ""; // Database Name $mysql_username = ""; // Database User $mysql_password = ""; // Database Password $mysql_link = mysql_connect($mysql_hostname, $mysql_username, $mysql_password); mysql_select_db($mysql_database, $mysql_link); $q = utf8_encode("città"); $query = "INSERt INTO TMS_DEV(segment) value('".$q."')"; mysql_query($query,$mysql_link); echo mysql_error(); ----------------------------------------------------------------------------------------- in JAVA: ----------------------------------------------------------------------------------------- //I've used also: //?useUnicode=true&characterEncoding=UTF-8&characterSetResults=UTF-8 String jdbc = "jdbc:mysql://server_ip/test"; String usernm= ""; String pwd = ""; ResultSet rs; PreparedStatement stmt; Class.forName("com.mysql.jdbc.Driver"); con = DriverManager.getConnection(jdbc,usernm,pwd); String q = "select segment from TMS_DEV "; stmt = con.prepareStatement(q); rs = stmt.executeQuery(); while(rs.next()) { String t = new String(rs.getString("segment").getBytes("ISO-8859-1")); System.out.println("ISO: "+t); //correct output but completly wrong approach!! System.out.println("ERROR: "+rs.getString("segment")); }
[13 Aug 2007 19:13]
Tonci Grgin
Hello all. This is a complex issue so I'll try to explain: 1) I have walked through the Connector/J source to follow the select at low levels - I am satisfied that Connector/J is not reformatting the bytes to produce garbage - the garbage seems to be emerging from the db, in what I am realising is perhaps a discrepancy in the insert model. Half truth... Everything except "discrepancy in the insert model" is true. 2) This bug has *not* been reproduced properly if you are using only Java - we know that if Java-Connector/J inserts UTF-8, then Java-Connector/J retrieves it well. You must reproduce this bug inserting using Perl or Ruby - if you do not code in Perl or Ruby please can we find someone who can So this is a bug report on the way some PERL / RUBY code works? Joking, but just want to make the point that error can come from anywhere, no matter how it looks to you. 3) Strangely enough, keeping tables encoded as latin1, and asking Java to do the classic 'new String(rs.getBytes("column_name"), "UTF-8"));' ploy works well. I think this is to the point. What I believe is happening is this: - You have MySQL server set with defaults. What concerns us is charset variables and they are Latin1 by default. So, during insert, you most likely have Latin1 converted to UTF8. Then when Java connects, mysqld tells the JDBC driver that the characters are in UTF-8, but they aren't. I see you're using all defaults for the character set in posted PHP example... This situation can be fixed by issuing "SET NAMES utf8" as first query in PHP code (before any insert takes place). -------------------------------- To all, please do the following: -------------------------------- 1) Make sure MySQL server is configured properly for tasks you want: C:\mysql507\bin>mysql -uroot -p test Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 1 Server version: 5.0.48-pb1030-log MySQL Pushbuild Edition, build 1030 Type 'help;' or '\h' for help. Type '\c' to clear the buffer. mysql> show variables like "%char%"; +--------------------------+-----------------------------+ | Variable_name | Value | +--------------------------+-----------------------------+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | latin1 | | character_set_server | utf8 | | character_set_system | utf8 | | character_sets_dir | C:\mysql507\share\charsets\ | +--------------------------+-----------------------------+ 8 rows in set (0.03 sec) 2) Check "real" characters returned with a new String(rs.getBytes(), "ISO8859_1") to confirm my suspicions as with rs.getString() the JDBC driver is going to use whatever character set it is told to by MySQL. 3) I've been informed that PHP5 is non-unicode Thank you all for your interest in MySQL.
[14 Aug 2007 11:13]
Luca Rondanini
Hi all, I've found a stupid workaround: String a222 = ""; InputStream stream=rs.getBinaryStream("field"); int read; while((read = stream.read()) != -1) { a222 +=(char)read; } Mysql variables: +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_results | latin1 | | character_set_server | latin1 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+ The jdbc connection string use: ?characterEncoding=ISO8859_1&characterSetResults=ISO8859_1 NB: If I try to change all the variables to UTF-8(mysql+java conn string)..as should be done....even the bytes from the getBinaryStream function are corrupted!!
[9 Jan 2008 14:26]
Benjamin Horsleben
I have the same problem; I tried to insert some rows through a java-app and they are inserted into the database in iso-8859-1 rather than UTF-8. Charset of database, table and column are all utf-8. Server variables are as follows: mysql> show variables like "%char%"; +--------------------------+------------------------------------------------------------+ | Variable_name | Value | +--------------------------+------------------------------------------------------------+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | latin1 | | character_set_server | utf8 | | character_set_system | utf8 | | character_sets_dir | /usr/local/mysql-5.0.45-osx10.4-i686/share/mysql/charsets/ | +--------------------------+------------------------------------------------------------+ (As a side note, how can I force the last latin1 occurrences to be utf8?) A select-query run from the mysql command line client shows '?' in place of several characters, as does a query run from PHP code. A debug of the java-program prior to insertion has confirmed that the characters was correct before insertion. The java code is basically as follows, unnecessary lines have been removed for clarity: <code> String url = "jdbc:mysql://"+host+":"+port+"/"+database; Connection connection = DriverManager.getConnection(url, username, password); Statement st = connection.createStatement(ResultSet.TYPE_SCROLL_SENSITIVE, ResultSet.CONCUR_UPDATABLE); ResultSet rs = st.executeQuery("SELECT * FROM bank_transfers WHERE receipt="+r.getId()); rs.moveToInsertRow(); rs.updateInt("receipt", receiptId); rs.updateString("text", t.getText()); rs.insertRow(); </code>
[14 Feb 2008 17:03]
Patrick Hayden
I also have this problem in the following environment: +-------------------------+-----------------------------------------------+ | Variable_name | Value | +-------------------------+-----------------------------------------------+ | protocol_version | 10 | | version | 5.0.37 | | version_comment | Mandriva Linux - MySQL Standard Edition (GPL) | | version_compile_machine | i586 | | version_compile_os | mandriva-linux-gnu | +-------------------------+-----------------------------------------------+ using MySQL connector JDBC driver version 5.0.5. My character set settings match those of the developer: +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | latin1 | | character_set_server | utf8 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+ What I see is that UTF-8 characters that are stored using the JDBC driver are displayed as ??? in the database and can only be retrieved using a SELECT from the JDBC driver and not using a SELECT in the MySQL client or in a Perl script. UTF-8 characters that are stored using an INSERT statement in the MySQL client can be retrieved using a SELECT statement in the MySQL client or from a Perl script. They CANNOT be retrieved using SELECT from the JDBC driver. Conclusion - the data is being mangled/de-mangled by the JDBC driver or by whatever part of the database communicates with the JDBC driver. Has it got something to do with the value of character_set_connection? Why are character_set_client, character_set_collection and character_set_results set to latin1? Shouldn't everything be utf-8?
[18 Feb 2008 16:17]
Sébastien Le Ray
Same problem here, data inserted with JDBC are correctly fetched through JDBC but are unreadable using command line client or PHP. Here are relevant information: Java side ========= cnx = DriverManager.getConnection("jdbc:mysql://localhost/db?useUnicode=true&characterEncoding=UTF-8"...); PreparedStatement pstmt = cnx.prepareStatement("UPDATE tbl SET title=? WHERE id = 6520"); pstmt.setBytes(1, "teéééééééééééést".getBytes("UTF-8")); pstmt.execute(); ResultSet rs = plop.executeQuery("SELECT title FROM tbl WHERE id = 1234"); while(rs.next()) { System.out.println(rs.getString(1) + " => " + rs.getString(2)); } => correctly displays teéééééééééééést character_set_client => utf8 character_set_connection => utf8 character_set_database => utf8 character_set_filesystem => binary character_set_results => character_set_server => utf8 character_set_system => utf8 Other side ============ mysql> show variables like '%char%'; +--------------------------+--------------------------------------------------+ | Variable_name | Value | +--------------------------+--------------------------------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | utf8 | | character_set_system | utf8 | | character_sets_dir | /usr/local/Zend/Core/mysql/share/mysql/charsets/ | +--------------------------+--------------------------------------------------+ mysql> select title from tbl where id = 6520; +------------------+ | title | +------------------+ | te������������st | +------------------+ N.B. I get the same junk using PHP, I tried almost every possible parameters for JDBC URI and for pstmt.setXXX... MySQL 5.0.51 Connector/J 5.1.5