Bug #25528 Connector/J Not Handling Unicode (UTF8) Properly
Submitted: 10 Jan 2007 18:02 Modified: 26 Jul 2007 17:11
Reporter: Paul Dlug Email Updates:
Status: Can't repeat Impact on me:
None 
Category:Connector / J Severity:S1 (Critical)
Version:5.0.4 OS:Any (Any)
Assigned to: Alexander Soklakov CPU Architecture:Any

[10 Jan 2007 18:02] Paul Dlug
Description:
Originally posted to the forum, but decided that this probably deserves attention as a bug since it is fairly serious and can be reproduced: http://forums.mysql.com/read.php?39,134161,134161#msg-134161

I seem to be having some issues with the MySQL JDBC driver when dealing with records written/read by other languages, in a nutshell the problem is: 

1) Write a utf8 string from perl via DBD::mysql to a table, reading back the string it is well formed utf8 (after calling decode_utf8 on it to mark it in perl) 
2) Read back the string with java -- string is mangled 

The problem occurs in reverse as well. Java does read back the strings as valid UTF8 if it writes them itself but they are clearly mangled in mysql command line client or another language such as perl or ruby. 

For example, the first record below was written with perl, the second with java: 

mysql> select * from articles; 
+-------------------+ 
| title | 
+-------------------+ 
| Mössbauer Effect | 
| M??ssbauer Effect | 
+-------------------+ 

Clearly the first record is correct, validly formed utf8 data. I have ruled this out as a problem with java utf8 string handling itself because if I read in a file containing utf8 data and write it out to another I am able to diff it with no changes and read it with perl and ruby as perfectly well formed utf8 data. I have also tested this using utf8 data over a servlet with the appropriate encodings, transferring utf8 data this way between languages works just fine. The problem appears to be isolated to writing to a MySQL DB via JDBC. 

All of my tables are InnoDB with character set utf8, collation utf8_bin. Defaults on the server are the same. I have tested with server versions 5.0.24 and 5.1.14 both with MySQL Connector/J version 5.0.4. The connect string I am using with JDBC has the options: useUnicode=true&characterEncoding=UTF-8&characterSetResults=UTF-8 

What am I doing wrong here? Is there a bug with Connector/J? 

I can provide code samples and other information as needed.

How to repeat:
Default and table charset set to utf8, collation utf8_bin

1. Write a utf8 string to a table using DBD::mysql (perl) or ruby mysql

2. Read back the string using perl or ruby, well formed utf8 string

3. Read back the string using Connector/J, mangled utf8 string results.

Or in reverse:

1. Write a utf8 string to a table using Connector/J

2. Read back the string using Connector/J, well formed utf8 string as far as java is concerned.

3. Read back the string using perl, ruby, or mysql command line client, mangled utf8 string results
[19 Jan 2007 10:15] Tonci Grgin
Hi Paul and thanks for your report. I was unable to verify described behavior.
I took "reverse" test since I don't use pearl/ruby.

Environment:
  - MySQL 5.0.34BK on WinXP Pro SP2 localhost
  - c/J 5.0 latest sources
  - JDK 1.5.0_10

Test:
  create database testutf DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
  Make Java test case using Eclipse, default VM arguments: -Xmx256M -Dcom.mysql.jdbc.testsuite.url.default=jdbc:mysql://localhost:3306/testutf?user=root&password=&useServerPrepStmts=false&jdbcCompliantTruncation=true

	public void testBug25528() throws Exception {
		try {
			Properties props = new Properties();
			
			props.put("useUnicode", "true");
			props.put("characterEncoding", "utf8");
			Connection conn2 = getConnectionWithProps(props);
			Statement stmt2 = conn2.createStatement();

			stmt2.executeUpdate("DROP TABLE IF EXISTS testBug25528");
			stmt2.executeUpdate("CREATE TABLE testBug25528("
					+ "Id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,"
					+ " translationValue char(255)"
					+ "\n)ENGINE=InnoDB DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;");
			stmt2.executeUpdate("insert into testBug25528 values(NULL,'Mössbauer Effect')");
			this.rs = stmt2.executeQuery("SELECT * FROM testBug25528");
            		assertEquals(true, this.rs.next());

          		System.out.print(this.rs.getObject(1));
          		System.out.print(this.rs.getObject(2));
            		System.out.print('\t');
            		System.out.println();
			
		} finally {
			closeMemberJDBCResources();
		}
	}

  Run Java test case:

.Loading JDBC driver 'com.mysql.jdbc.Driver'
Done.

Done.

Connected to 5.0.34-log
1Mössbauer Effect	

Time: 0,484

OK (1 test)
--------------------------------------------------------------

  Run select in mysql cl client:
C:\mysql507\bin>mysql -uroot testutf
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.0.34-log Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> select * from testbug25528;
+----+------------------+
| Id | translationValue |
+----+------------------+
|  1 | Mössbauer Effect |
+----+------------------+
1 row in set (0.00 sec)

mysql> show variables like "%char%";
+--------------------------+-----------------------------+
| Variable_name            | Value                       |
+--------------------------+-----------------------------+
| character_set_client     | latin1                      |
| character_set_connection | latin1                      |
| character_set_database   | utf8                        |
| character_set_filesystem | binary                      |
| character_set_results    | latin1                      |
| character_set_server     | utf8                        |
| character_set_system     | utf8                        |
| character_sets_dir       | C:\mysql507\share\charsets\ |
+--------------------------+-----------------------------+
8 rows in set (0.00 sec)
--------------------------------------------------------------

  Run "independent" tool to check what's written: See attached image from MySQL QB

If you have any more info to add to this report, please reopen it.
[19 Jan 2007 10:16] Tonci Grgin
Same select in QB

Attachment: 25528.jpg (image/pjpeg, text), 60.43 KiB.

[29 Jun 2007 4:19] Rebecca Berrigan
Greetings

I have reproduced this bug using MySQL 4.1.18/20 and Connector/J 5.0/3.1 on Linux (CentOS) and MacOSX(10.4). The bug arises *specifically* when UTF-8 characters (I was testing using Japanese words) are inserted into utf-8 tables (collation utf8_general_ci) using Perl OR the command line client (cut/paste UTF-8), and is then retrieved using Connector/J-Java.

I have walked through the Connector/J source to follow the select at low levels - I am satisfied that Connector/J is not reformatting the bytes to produce garbage - the garbage seems to be emerging from the db, in what I am realising is perhaps a discrepancy in the insert model.

Strangely enough, keeping tables encoded as latin1, and asking Java to do the classic 'new String(rs.getBytes("column_name"), "UTF-8"));' ploy works well.

This bug has *not* been reproduced properly if you are using only Java - we know that if Java-Connector/J inserts UTF-8, then Java-Connector/J retrieves it well. You must reproduce this bug inserting using Perl or Ruby - if you do not code in Perl or Ruby please can we find someone who can
[26 Jul 2007 15:03] Paul Dlug
I think you are correct in your reproduction. I am still having problems with this despite trying numerous variations on the new String(rs.getBytes(""), "UTF-8"). Does anyone have any other ideas?
[26 Jul 2007 17:03] Paul Dlug
Scripts to replicate the problem

Attachment: MySQLUnicode.tar.gz (application/x-gzip, text), 498.59 KiB.

[26 Jul 2007 17:11] Paul Dlug
I just attached a bundle of code to replicate the problem. Run it like this:

Create the database and tables:

mysql -uroot < setup.sql

Use perl to insert/query back the table and test that the UTF-8 is well formed:

perl run-perl-test.pl

Use ruby to insert/query back the table and test that the UTF-8 is well formed:

ruby run-ruby-test.rb

Use java to insert/query back the table:

./run-java-test.sh

You can keep running the scripts in any order they each insert a distinct UTF-8 string so you can see their result:

Perl: Mössbauer
Ruby: Müller
Java:  Hjörvarsson

You'll find in running the tests that ruby, perl, and the command line mysql client are all able to see the correct UTF-8 characters but see the rows inserted by java as mangled but valid UTF-8. Java gets mangled results for all but the rows it inserts.

Requirements:

- Perl: DBI/DBD::MySQL

- Ruby: ruby mysql and active_support (part of rails) for the nice unicode functions

- Java: any java runtime >= 1.5
[13 Aug 2007 17:15] Luca Rondanini
I'm experiencing the same problem with PHP:
(mysql 5.0.18, java 1.5, Connectors 5.0->5.1)

-----------------------------------------------------------------------------------------
table:
-----------------------------------------------------------------------------------------

CREATE TABLE `TMS_DEV` (
  `segment` varchar(20) collate utf8_general_ci NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8_general_ci COLLATE=utf8_general_ci;

-----------------------------------------------------------------------------------------
in PHP:
-----------------------------------------------------------------------------------------

$mysql_hostname  = "";	// Database Server machine
$mysql_database  = "";  // Database Name
$mysql_username  = "";	// Database User
$mysql_password  = "";	// Database Password

$mysql_link = mysql_connect($mysql_hostname, $mysql_username, $mysql_password);
mysql_select_db($mysql_database, $mysql_link);
$q = utf8_encode("città");
$query = "INSERt INTO TMS_DEV(segment) value('".$q."')";
mysql_query($query,$mysql_link);
echo mysql_error();

-----------------------------------------------------------------------------------------
in JAVA:
-----------------------------------------------------------------------------------------

//I've used also:
//?useUnicode=true&characterEncoding=UTF-8&characterSetResults=UTF-8

String jdbc = "jdbc:mysql://server_ip/test";
String usernm= "";
String pwd = "";
ResultSet rs;
PreparedStatement stmt;

Class.forName("com.mysql.jdbc.Driver");
con = DriverManager.getConnection(jdbc,usernm,pwd);

String q = "select segment from TMS_DEV ";

stmt = con.prepareStatement(q);
rs = stmt.executeQuery();
while(rs.next()) {
	String t = new String(rs.getString("segment").getBytes("ISO-8859-1"));
	System.out.println("ISO: "+t);  //correct output but completly wrong approach!!
	System.out.println("ERROR: "+rs.getString("segment"));
}
[13 Aug 2007 19:13] Tonci Grgin
Hello all. This is a complex issue so I'll try to explain:

1) I have walked through the Connector/J source to follow the select at low levels - I am satisfied that Connector/J is not reformatting the bytes to produce garbage - the garbage seems to be emerging from the db, in what I am realising is perhaps a discrepancy in the insert model.

Half truth... Everything except "discrepancy in the insert model" is true.

2) This bug has *not* been reproduced properly if you are using only Java - we know that if Java-Connector/J inserts UTF-8, then Java-Connector/J retrieves it well. You must reproduce this bug inserting using Perl or Ruby - if you do not code in Perl or Ruby please can we find someone who can

So this is a bug report on the way some PERL / RUBY code works? Joking, but just want to make the point that error can come from anywhere, no matter how it looks to you.

3) Strangely enough, keeping tables encoded as latin1, and asking Java to do the classic 'new String(rs.getBytes("column_name"), "UTF-8"));' ploy works well.

I think this is to the point. What I believe is happening is this:
 - You have MySQL server set with defaults. What concerns us is charset variables and they are Latin1 by default. So, during insert, you most likely have Latin1 converted to UTF8. Then when Java connects, mysqld tells the JDBC driver that the characters are in UTF-8, but they aren't. I see you're using all defaults for the character set in posted PHP example...
This situation can be fixed by issuing "SET NAMES utf8" as first query in PHP code (before any insert takes place).

--------------------------------
To all, please do the following:   
--------------------------------
 1) Make sure MySQL server is configured properly for tasks you want:
C:\mysql507\bin>mysql -uroot -p test
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.0.48-pb1030-log MySQL Pushbuild Edition, build 1030

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> show variables like "%char%";
+--------------------------+-----------------------------+
| Variable_name            | Value                       |
+--------------------------+-----------------------------+
| character_set_client     | latin1                      |
| character_set_connection | latin1                      |
| character_set_database   | utf8                        |
| character_set_filesystem | binary                      |
| character_set_results    | latin1                      |
| character_set_server     | utf8                        |
| character_set_system     | utf8                        |
| character_sets_dir       | C:\mysql507\share\charsets\ |
+--------------------------+-----------------------------+
8 rows in set (0.03 sec)

 2) Check "real" characters returned with a new String(rs.getBytes(), "ISO8859_1") to confirm my suspicions as with rs.getString() the JDBC driver is going to use whatever character set it is told to by MySQL.

 3) I've been informed that PHP5 is non-unicode

Thank you all for your interest in MySQL.
[14 Aug 2007 11:13] Luca Rondanini
Hi all,

I've found a stupid workaround:

String a222 = "";
InputStream stream=rs.getBinaryStream("field");
int read;
while((read = stream.read()) != -1) {
   a222 +=(char)read;
}

Mysql variables:
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

The jdbc connection string use:
?characterEncoding=ISO8859_1&characterSetResults=ISO8859_1

NB:
If I try to change all the variables to UTF-8(mysql+java conn string)..as should be done....even the bytes from the getBinaryStream function are corrupted!!
[9 Jan 2008 14:26] Benjamin Horsleben
I have the same problem; I tried to insert some rows through a java-app and they are inserted into the database in iso-8859-1 rather than UTF-8.

Charset of database, table and column are all utf-8. Server variables are as follows:
mysql> show variables like "%char%";
+--------------------------+------------------------------------------------------------+
| Variable_name            | Value                                                      |
+--------------------------+------------------------------------------------------------+
| character_set_client     | latin1                                                     | 
| character_set_connection | latin1                                                     | 
| character_set_database   | utf8                                                       | 
| character_set_filesystem | binary                                                     | 
| character_set_results    | latin1                                                     | 
| character_set_server     | utf8                                                       | 
| character_set_system     | utf8                                                       | 
| character_sets_dir       | /usr/local/mysql-5.0.45-osx10.4-i686/share/mysql/charsets/ | 
+--------------------------+------------------------------------------------------------+
(As a side note, how can I force the last latin1 occurrences to be utf8?)

A select-query run from the mysql command line client shows '?' in place of several characters, as does a query run from PHP code. A debug of the java-program prior to insertion has confirmed that the characters was correct before insertion.

The java code is basically as follows, unnecessary lines have been removed for clarity:
<code>
String url = "jdbc:mysql://"+host+":"+port+"/"+database;
Connection connection = DriverManager.getConnection(url, username, password);
Statement st = connection.createStatement(ResultSet.TYPE_SCROLL_SENSITIVE, ResultSet.CONCUR_UPDATABLE);
ResultSet rs = st.executeQuery("SELECT * FROM bank_transfers WHERE receipt="+r.getId());
rs.moveToInsertRow();
rs.updateInt("receipt", receiptId);
rs.updateString("text", t.getText());
rs.insertRow();
</code>
[14 Feb 2008 17:03] Patrick Hayden
I also have this problem in the following environment:

+-------------------------+-----------------------------------------------+
| Variable_name           | Value                                         |
+-------------------------+-----------------------------------------------+
| protocol_version        | 10                                            |
| version                 | 5.0.37                                        |
| version_comment         | Mandriva Linux - MySQL Standard Edition (GPL) |
| version_compile_machine | i586                                          |
| version_compile_os      | mandriva-linux-gnu                            |
+-------------------------+-----------------------------------------------+

using MySQL connector JDBC driver version 5.0.5.

My character set settings match those of the developer:

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

What I see is that UTF-8 characters that are stored using the JDBC driver are displayed as ??? in the database and can only be retrieved using a SELECT from the JDBC driver and not using a SELECT in the MySQL client or in a Perl script.

UTF-8 characters that are stored using an INSERT statement in the MySQL client can be retrieved using a SELECT statement in the MySQL client or from a Perl script. They CANNOT be retrieved using SELECT from the JDBC driver.

Conclusion - the data is being mangled/de-mangled by the JDBC driver or by whatever part of the database communicates with the JDBC driver.

Has it got something to do with the value of character_set_connection? Why are character_set_client, character_set_collection and character_set_results set to latin1? Shouldn't everything be utf-8?
[18 Feb 2008 16:17] Sébastien Le Ray
Same problem here, data inserted with JDBC are correctly fetched through JDBC but are unreadable using command line client or PHP.

Here are relevant information:

Java side
=========

cnx = DriverManager.getConnection("jdbc:mysql://localhost/db?useUnicode=true&characterEncoding=UTF-8"...);

PreparedStatement pstmt = cnx.prepareStatement("UPDATE tbl SET title=? WHERE id = 6520");

pstmt.setBytes(1, "teéééééééééééést".getBytes("UTF-8"));
pstmt.execute();

ResultSet rs = plop.executeQuery("SELECT title FROM tbl WHERE id = 1234");
while(rs.next()) {
      System.out.println(rs.getString(1) + " => " + rs.getString(2));
}

=> correctly displays teéééééééééééést

character_set_client => utf8
character_set_connection => utf8
character_set_database => utf8
character_set_filesystem => binary
character_set_results => 
character_set_server => utf8
character_set_system => utf8

Other side
============

mysql> show variables like '%char%';
+--------------------------+--------------------------------------------------+
| Variable_name            | Value                                            |
+--------------------------+--------------------------------------------------+
| character_set_client     | utf8                                             |
| character_set_connection | utf8                                             |
| character_set_database   | utf8                                             |
| character_set_filesystem | binary                                           |
| character_set_results    | utf8                                             |
| character_set_server     | utf8                                             |
| character_set_system     | utf8                                             |
| character_sets_dir       | /usr/local/Zend/Core/mysql/share/mysql/charsets/ |
+--------------------------+--------------------------------------------------+

mysql> select title from tbl where id = 6520;
+------------------+
| title            |
+------------------+
| te������������st |
+------------------+

N.B. I get the same junk using PHP, I tried almost every possible parameters for JDBC URI and for pstmt.setXXX...

MySQL 5.0.51
Connector/J 5.1.5