Bug #53356 Character set bug at server with utf8 column and latin1 connection
Submitted: 2 May 2010 11:13 Modified: 1 Jul 2010 17:22
Reporter: Eduardo Pérez Ureta Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.1.45-3 (Debian) OS:Linux
Assigned to: CPU Architecture:Any
Tags: charset, java

[2 May 2010 11:13] Eduardo Pérez Ureta
Description:
// This bug is reproduced using a MySQL Linux default installation where "character_set_server" is "latin1"
// and the Java connector sends "SET NAMES latin1"
// The first INSERT and the SELECT work OK
// but the UPDATE fails with "Incorrect string value: '\xED' for column 'id' at row 1" (and it should not)

How to repeat:
public class MysqlCharsetBug {

	public static void main(String[] args) throws Exception {

		String jdbcClass = "com.mysql.jdbc.Driver";
		String jdbcURL = "jdbc:mysql://localhost/bugtest";
		String username = "root";
		String password = "";

		java.sql.Connection conn = null;
		try {
			Class.forName(jdbcClass);
			conn = java.sql.DriverManager.getConnection (jdbcURL, username, password);

			java.sql.PreparedStatement createtablepstmt = conn.prepareStatement(
			        "CREATE TABLE mysqlcharsettest (id VARCHAR(100) NOT NULL, PRIMARY KEY(id)) ENGINE = InnoDB DEFAULT CHARACTER SET utf8");
			createtablepstmt.executeUpdate();
			createtablepstmt.close();

			final String stringtoinsert = "\u00ED"; // "\u0155"; // if "SET NAMES latin1" then non latin1 characters are silently converted to '?'
			System.out.println(stringtoinsert);

			java.sql.PreparedStatement insertstringpstmt = conn.prepareStatement(
			        "INSERT INTO mysqlcharsettest (id) VALUES (?)");
			insertstringpstmt.setString(1,stringtoinsert);
			insertstringpstmt.executeUpdate();
			insertstringpstmt.close();

			String string = null;
			java.sql.PreparedStatement pstmt = conn.prepareStatement(
					"SELECT id FROM mysqlcharsettest");
			java.sql.ResultSet rs = pstmt.executeQuery();
			while(rs.next()) {
				string = rs.getString("id");
				System.out.println(string);
			}
			rs.close();
			pstmt.close();

			java.sql.PreparedStatement updatestringpstmt = conn.prepareStatement(
					"UPDATE mysqlcharsettest SET id = CASE 1 WHEN 0 THEN id ELSE ? END");
			updatestringpstmt.setString(1,string);
			updatestringpstmt.executeUpdate();
			updatestringpstmt.close();

		} finally {
			try {
			if (conn != null) {
				conn.close();
			}
			} catch (java.sql.SQLException e) {}
		}

	}

}

Suggested fix:
The workaround is using:
characterEncoding=utf8
as JDBC URL parameter

But, this should work as the character being used is latin1
[2 May 2010 19:57] Sveta Smirnova
Thank you for the report.

But what is the bug here? If you insert character into UTF8 column using latin1 character set it is expected data will be corrupted.
[3 May 2010 5:39] Eduardo Pérez Ureta
If you do not understand the bug report just try the code since it triggers the bug.

Did you try the code?

What I mean is that the first INSERT works correctly because the char belongs to the "Latin-1 Supplement" unicode group and the char is also at "latin1" charset
but the UPDATE fails being the same character!

There is a bug in MySQL server when handling the character set in the UPDATE statement!
[3 May 2010 5:59] Tonci Grgin
Eduardo, code is clear. What is not clear to you is that *any* connector will set character_set_result to NULL but, since you have character_set_server latin1, latin1 will be used character_set_result initially. Further more, how can you not understand that UTF8 *stream* can never be the same as latin1 *byte*??? Please do read at least some of charset documentation and, preferably, http://forge.mysql.com/wiki/How_to_Use_Charsets_and_Collations_Properly.
[3 May 2010 8:27] Eduardo Pérez Ureta
I have seen the presentation and understand all of the concepts, but it misleads because the mysql-connector-java sends "SET NAMES latin1" when "character_set_server" is "latin1" and the presentation says it always uses UTF8

Back to the bug in question.
Could you give an explanation why:
"INSERT INTO mysqlcharsettest (id) VALUES (?)" works
"UPDATE mysqlcharsettest SET id = CASE 1 WHEN 0 THEN id ELSE ? END" does not work
"UPDATE mysqlcharsettest SET id = ?" works
?
[5 May 2010 8:47] Susanne Ebrecht
Where did you see that the connector is sending "SET NAMES latin1"?

Please provide general log of these action.
[5 May 2010 8:51] Tonci Grgin
Eduardo, see my answers in between lines:

> I have seen the presentation and understand all of the concepts, but it misleads because the mysql-connector-java sends "SET NAMES latin1" when "character_set_server" is "latin1" and the presentation says it always uses UTF8

All of the connectors set names to UTF8 and charset_results to null. Setting later to null means that connectors auto-discover charset data is in. Right now, I can not think why would c/J send SET NAMES latin1, as Susanne already said.

> Could you give an explanation why:
"INSERT INTO mysqlcharsettest (id) VALUES (?)" works
"UPDATE mysqlcharsettest SET id = CASE 1 WHEN 0 THEN id ELSE ? END" does not work "UPDATE mysqlcharsettest SET id = ?" works

No as there seems to be quite a mess in your settings. Maybe a BINARY flag is set for the result (you can check that in cl. client started with --column-type-info) or something. So please provide, besides what Susanne asked, the output of:
  SHOW VARIABLES LIKE "%char%";
  SHOW CREATE DATABASE your_db;
  SHOW CREATE TABLE the_table_in_question;
[5 May 2010 11:31] Eduardo Pérez Ureta
Could you guys reproduce the bug using my code?

> Where did you see that the connector is sending "SET NAMES latin1"?
I fired up wireshark and saw it in the TCP communication.

Here are the commands sent:
/* mysql-connector-java-5.1.12 ( Revision: ${bzr.revision-id} ) */SHOW VARIABLES WHERE Variable_name ='language' OR Variable_name = 'net_write_timeout' OR Variable_name = 'interactive_timeout' OR Variable_name = 'wait_timeout' OR Variable_name = 'character_set_client' OR Variable_name = 'character_set_connection' OR Variable_name = 'character_set' OR Variable_name = 'character_set_server' OR Variable_name = 'tx_isolation' OR Variable_name = 'transaction_isolation' OR Variable_name = 'character_set_results' OR Variable_name = 'timezone' OR Variable_name = 'time_zone' OR Variable_name = 'system_time_zone' OR Variable_name = 'lower_case_table_names' OR Variable_name = 'max_allowed_packet' OR Variable_name = 'net_buffer_length' OR Variable_name = 'sql_mode' OR Variable_name = 'query_cache_type' OR Variable_name = 'query_cache_size' OR Variable_name = 'init_connect'
/* mysql-connector-java-5.1.12 ( Revision: ${bzr.revision-id} ) */SELECT @@session.auto_increment_increment
SHOW COLLATION
SET NAMES latin1
SET character_set_results = NULL
SET autocommit=1
SET sql_mode='STRICT_TRANS_TABLES'
CREATE TABLE mysqlcharsettest (id VARCHAR(100) NOT NULL, PRIMARY KEY(id)) ENGINE = InnoDB DEFAULT CHARACTER SET utf8
INSERT INTO mysqlcharsettest (id) VALUES ('í')
SELECT id FROM mysqlcharsettest
UPDATE mysqlcharsettest SET id = CASE 1 WHEN 0 THEN id ELSE 'í' END

If you cannot reproduce the bug using the provided code, I may send you the connection log.
What format would you like to get the log?

> All of the connectors set names to UTF8 and charset_results to null. Setting later to
null means that connectors auto-discover charset data is in. Right now, I can not think
why would c/J send SET NAMES latin1, as Susanne already said.

I am using mysql-connector-java-5.1.12 and it is sending "SET NAMES latin1"

All of these queries are done using the mysql-connector-java-5.1.12

SHOW VARIABLES LIKE "%char%";
character_set_client     latin1
character_set_connection latin1
character_set_database   latin1
character_set_filesystem binary
character_set_results    NULL
character_set_server     latin1
character_set_system     utf8
character_sets_dir       /usr/share/mysql/charsets/

SHOW CREATE DATABASE bugtest;
bugtest CREATE DATABASE `bugtest` /*!40100 DEFAULT CHARACTER SET latin1 */

SHOW CREATE TABLE mysqlcharsettest;
mysqlcharsettest
CREATE TABLE `mysqlcharsettest` (
  `id` varchar(100) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
[6 May 2010 5:16] Susanne Ebrecht
Hello Eduardo,

you are totally right my presentation has a bug.

The JDBC driver uses character_set_server and does a SET NAMES <whatever_character_set_server_is_set_to>.

Unless the user overrides it with characterEncoding=<whatever> in the URL.

But where do you have a bug?
[6 May 2010 5:47] Eduardo Pérez Ureta
Thanks for recognizing the bug in your presentation.
I would really like mysql-connector-java to use utf8 by default, will you fix it?

Did you try the code I provided?
Did it fail in the UPDATE?
Do you think it should fail?
[6 May 2010 12:13] Tonci Grgin
Eduardo:

> Thanks for recognizing the bug in your presentation.

I would not say a bug, mistake maybe. Susanne used to work a lot on ODBC and ODBC standard defines so called "W" functions for handling MB c-sets. Thus we developed fully MB c/ODBC version, 5.1 (as opposed to SB 3.51). Thus her mistake. Me? I have no excuse. I'm just so used to work with UTF-8 that I fail to recognize other options :)

> I would really like mysql-connector-java to use utf8 by default, will you fix it?

No as there is nothing to "fix". Answer is in the presentation. You are always free to use UTF8 when you want/need to.

So, MyODBC has separate SB and MB charsets versions (3.51 and 5.1 respectively).

c/J utilizes character_set_server because if client does not define it otherwise, this is what will be used as a default for the client anyways. So, SET NAMES latin1 sets character_set_client, character_set_result and character_set_connection to latin1. Instantly after that, (ALL) connectors set character_set_result to NULL (as can be seen in logs) to allow for auto-discovery of actual charset data is in. This is only because *YOU* defined your server and client as such.

Regarding charactet_set_results, c/NET uses similar approach. It does 1 of 3 things:
  (1) set charactet_set_results=null
  (2) set character_set_results=<charset given by user on connection string>
  (3) set character_set_results=<charset given by server upon connection or default>
which is basically the same as c/J does in your case. And it also issues SET NAMES latin1 if conditions met.

Basically, it all comes down to this. ALL three variables Susanne mentioned should be in sync otherwise you will get either garbled output or input. If you need to cover wide range of characters, go for UTF8. If speed/space is of essence, use SB charset.

> Did you try the code I provided?
Yes.

> Did it fail in the UPDATE?
No

Do you think it should fail?
No.
[6 May 2010 13:19] Eduardo Pérez Ureta
I have tested MySQL 5.1.45-3 (Debian) (x86 and x86-64) and MySQL 5.4.2 (SUSE) (x86)
What version and operating system did you use?
Did you check the charsets where configured and I sent?
[14 May 2010 11:10] Susanne Ebrecht
Eudardo,

I am missing here what do you think is the bug here?
[14 May 2010 23:39] Eduardo Pérez Ureta
The bug is the update:
UPDATE mysqlcharsettest SET id = CASE 1 WHEN 0 THEN id ELSE ? END
that fails with:
Incorrect string value: '\xED' for column 'id' at row 1

If you are not able to reproduce the bug I would like to know what database version and OS are you using and what are the charsets configured on the server.
The bug is triggered at MySQL 5.1.45-3 (Debian) (x86 and x86-64) and MySQL 5.4.2 (SUSE) (x86)
Can you confirm or deny the bug using a similar MySQL version and OS and let me know what they are?
[17 May 2010 11:44] Susanne Ebrecht
Did you set character_set_server to utf8?
[17 May 2010 21:59] Eduardo Pérez Ureta
No, as you shaw in one of my former posts:
character_set_server     latin1

Were you able to reproduce the bug?
[18 May 2010 6:59] Susanne Ebrecht
Many thanks for all your informations.

This is not a bug.

When you work with JDBC and characters that are out of range from latin1 then you need to set character_set_server to utf8 or whatever character set supports the letters you want to use.

'í' is not part of latin1. It is part of utf8.
[18 May 2010 8:25] Eduardo Pérez Ureta
I am sorry to inform you, but 'í' is part of latin1:
http://en.wikipedia.org/wiki/ISO/IEC_8859-1

There is a bug in MySQL Server.
[19 May 2010 5:04] Susanne Ebrecht
But 'í' that you used is utf8 encoded and you told the server that incoming data are latin1 encoded.

This is not a bug. Either use latin1 in your environment for filling data or when you want to use utf8 on your OS/editor then change variable character_set_server to utf8.
[19 May 2010 5:48] Eduardo Pérez Ureta
I am sorry to let you know once again that mysql-connector-java is correctly encoding 'í' as latin1 on the wire when:
character_set_server     latin1
[21 May 2010 9:07] Susanne Ebrecht
What do you think is the bug then?

I am not able to see a bug here. All works as expected.
[21 May 2010 15:09] Eduardo Pérez Ureta
Could you give an explanation why:
"INSERT INTO mysqlcharsettest (id) VALUES (?)" works
"UPDATE mysqlcharsettest SET id = CASE 1 WHEN 0 THEN id ELSE ? END" does not work
"UPDATE mysqlcharsettest SET id = ?" works
?
[25 May 2010 7:16] Susanne Ebrecht
Did you look into hex values of your table?
Are values stored wrong?
Is hex value from input utf8 instead of iso-8859-1?

We have no bug here. It looks like you have a charset mix somewhere in your system.
[1 Jul 2010 16:49] Sergei Golubchik
The values must be right. Did you see the general query log posted above ?
The string value in the query is *correct*.

The important piece is:

"UPDATE mysqlcharsettest SET id = CASE 1 WHEN 0 THEN id ELSE ? END" does not work
"UPDATE mysqlcharsettest SET id = ?" works

which was repeated quite a few times.

The reason must be the CASE operator. It aggregates the types of all possible results. This one can return either a string or a number. This gets aggregated, the charset of a result is binary. So, the 'í' string isn't converted from latin1 to utf8 because it's no longer latin1 - it's binary.

It's well known very annoying behavior. But as it's documented it's not aa bug.
Fixed in WL#2649
[1 Jul 2010 17:22] Eduardo Pérez Ureta
Do you mean that it is fixed in a new major revision?
Would you fix it in a new major revision?