MySQL Bugs: #53356: Character set bug at server with utf8 column and latin1 connection

Bug #53356	Character set bug at server with utf8 column and latin1 connection
Submitted:	2 May 2010 11:13	Modified:	1 Jul 2010 17:22
Reporter:	Eduardo Pérez Ureta	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S3 (Non-critical)
Version:	5.1.45-3 (Debian)	OS:	Linux
Assigned to:		CPU Architecture:	Any
Tags:	charset, java

Description:
// This bug is reproduced using a MySQL Linux default installation where "character_set_server" is "latin1"
// and the Java connector sends "SET NAMES latin1"
// The first INSERT and the SELECT work OK
// but the UPDATE fails with "Incorrect string value: '\xED' for column 'id' at row 1" (and it should not)

How to repeat:
public class MysqlCharsetBug {

	public static void main(String[] args) throws Exception {

		String jdbcClass = "com.mysql.jdbc.Driver";
		String jdbcURL = "jdbc:mysql://localhost/bugtest";
		String username = "root";
		String password = "";

		java.sql.Connection conn = null;
		try {
			Class.forName(jdbcClass);
			conn = java.sql.DriverManager.getConnection (jdbcURL, username, password);

			java.sql.PreparedStatement createtablepstmt = conn.prepareStatement(
			        "CREATE TABLE mysqlcharsettest (id VARCHAR(100) NOT NULL, PRIMARY KEY(id)) ENGINE = InnoDB DEFAULT CHARACTER SET utf8");
			createtablepstmt.executeUpdate();
			createtablepstmt.close();

			final String stringtoinsert = "\u00ED"; // "\u0155"; // if "SET NAMES latin1" then non latin1 characters are silently converted to '?'
			System.out.println(stringtoinsert);

			java.sql.PreparedStatement insertstringpstmt = conn.prepareStatement(
			        "INSERT INTO mysqlcharsettest (id) VALUES (?)");
			insertstringpstmt.setString(1,stringtoinsert);
			insertstringpstmt.executeUpdate();
			insertstringpstmt.close();

			String string = null;
			java.sql.PreparedStatement pstmt = conn.prepareStatement(
					"SELECT id FROM mysqlcharsettest");
			java.sql.ResultSet rs = pstmt.executeQuery();
			while(rs.next()) {
				string = rs.getString("id");
				System.out.println(string);
			}
			rs.close();
			pstmt.close();

			java.sql.PreparedStatement updatestringpstmt = conn.prepareStatement(
					"UPDATE mysqlcharsettest SET id = CASE 1 WHEN 0 THEN id ELSE ? END");
			updatestringpstmt.setString(1,string);
			updatestringpstmt.executeUpdate();
			updatestringpstmt.close();

		} finally {
			try {
			if (conn != null) {
				conn.close();
			}
			} catch (java.sql.SQLException e) {}
		}

	}

}

Suggested fix:
The workaround is using:
characterEncoding=utf8
as JDBC URL parameter

But, this should work as the character being used is latin1

Thank you for the report.

But what is the bug here? If you insert character into UTF8 column using latin1 character set it is expected data will be corrupted.

If you do not understand the bug report just try the code since it triggers the bug.

Did you try the code?

What I mean is that the first INSERT works correctly because the char belongs to the "Latin-1 Supplement" unicode group and the char is also at "latin1" charset
but the UPDATE fails being the same character!

There is a bug in MySQL server when handling the character set in the UPDATE statement!

Eduardo, code is clear. What is not clear to you is that *any* connector will set character_set_result to NULL but, since you have character_set_server latin1, latin1 will be used character_set_result initially. Further more, how can you not understand that UTF8 *stream* can never be the same as latin1 *byte*??? Please do read at least some of charset documentation and, preferably, http://forge.mysql.com/wiki/How_to_Use_Charsets_and_Collations_Properly.

I have seen the presentation and understand all of the concepts, but it misleads because the mysql-connector-java sends "SET NAMES latin1" when "character_set_server" is "latin1" and the presentation says it always uses UTF8

Back to the bug in question.
Could you give an explanation why:
"INSERT INTO mysqlcharsettest (id) VALUES (?)" works
"UPDATE mysqlcharsettest SET id = CASE 1 WHEN 0 THEN id ELSE ? END" does not work
"UPDATE mysqlcharsettest SET id = ?" works
?

Where did you see that the connector is sending "SET NAMES latin1"?

Please provide general log of these action.

Eduardo, see my answers in between lines:

> I have seen the presentation and understand all of the concepts, but it misleads because the mysql-connector-java sends "SET NAMES latin1" when "character_set_server" is "latin1" and the presentation says it always uses UTF8

All of the connectors set names to UTF8 and charset_results to null. Setting later to null means that connectors auto-discover charset data is in. Right now, I can not think why would c/J send SET NAMES latin1, as Susanne already said.

> Could you give an explanation why:
"INSERT INTO mysqlcharsettest (id) VALUES (?)" works
"UPDATE mysqlcharsettest SET id = CASE 1 WHEN 0 THEN id ELSE ? END" does not work "UPDATE mysqlcharsettest SET id = ?" works

No as there seems to be quite a mess in your settings. Maybe a BINARY flag is set for the result (you can check that in cl. client started with --column-type-info) or something. So please provide, besides what Susanne asked, the output of:
  SHOW VARIABLES LIKE "%char%";
  SHOW CREATE DATABASE your_db;
  SHOW CREATE TABLE the_table_in_question;

Could you guys reproduce the bug using my code?

> Where did you see that the connector is sending "SET NAMES latin1"?
I fired up wireshark and saw it in the TCP communication.

Here are the commands sent:
/* mysql-connector-java-5.1.12 ( Revision: ${bzr.revision-id} ) */SHOW VARIABLES WHERE Variable_name ='language' OR Variable_name = 'net_write_timeout' OR Variable_name = 'interactive_timeout' OR Variable_name = 'wait_timeout' OR Variable_name = 'character_set_client' OR Variable_name = 'character_set_connection' OR Variable_name = 'character_set' OR Variable_name = 'character_set_server' OR Variable_name = 'tx_isolation' OR Variable_name = 'transaction_isolation' OR Variable_name = 'character_set_results' OR Variable_name = 'timezone' OR Variable_name = 'time_zone' OR Variable_name = 'system_time_zone' OR Variable_name = 'lower_case_table_names' OR Variable_name = 'max_allowed_packet' OR Variable_name = 'net_buffer_length' OR Variable_name = 'sql_mode' OR Variable_name = 'query_cache_type' OR Variable_name = 'query_cache_size' OR Variable_name = 'init_connect'
/* mysql-connector-java-5.1.12 ( Revision: ${bzr.revision-id} ) */SELECT @@session.auto_increment_increment
SHOW COLLATION
SET NAMES latin1
SET character_set_results = NULL
SET autocommit=1
SET sql_mode='STRICT_TRANS_TABLES'
CREATE TABLE mysqlcharsettest (id VARCHAR(100) NOT NULL, PRIMARY KEY(id)) ENGINE = InnoDB DEFAULT CHARACTER SET utf8
INSERT INTO mysqlcharsettest (id) VALUES ('í')
SELECT id FROM mysqlcharsettest
UPDATE mysqlcharsettest SET id = CASE 1 WHEN 0 THEN id ELSE 'í' END

If you cannot reproduce the bug using the provided code, I may send you the connection log.
What format would you like to get the log?

> All of the connectors set names to UTF8 and charset_results to null. Setting later to
null means that connectors auto-discover charset data is in. Right now, I can not think
why would c/J send SET NAMES latin1, as Susanne already said.

I am using mysql-connector-java-5.1.12 and it is sending "SET NAMES latin1"

All of these queries are done using the mysql-connector-java-5.1.12

SHOW VARIABLES LIKE "%char%";
character_set_client     latin1
character_set_connection latin1
character_set_database   latin1
character_set_filesystem binary
character_set_results    NULL
character_set_server     latin1
character_set_system     utf8
character_sets_dir       /usr/share/mysql/charsets/

SHOW CREATE DATABASE bugtest;
bugtest CREATE DATABASE `bugtest` /*!40100 DEFAULT CHARACTER SET latin1 */

SHOW CREATE TABLE mysqlcharsettest;
mysqlcharsettest
CREATE TABLE `mysqlcharsettest` (
  `id` varchar(100) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

Hello Eduardo,

you are totally right my presentation has a bug.

The JDBC driver uses character_set_server and does a SET NAMES <whatever_character_set_server_is_set_to>.

Unless the user overrides it with characterEncoding=<whatever> in the URL.

But where do you have a bug?

Thanks for recognizing the bug in your presentation.
I would really like mysql-connector-java to use utf8 by default, will you fix it?

Did you try the code I provided?
Did it fail in the UPDATE?
Do you think it should fail?

Eduardo:

> Thanks for recognizing the bug in your presentation.

I would not say a bug, mistake maybe. Susanne used to work a lot on ODBC and ODBC standard defines so called "W" functions for handling MB c-sets. Thus we developed fully MB c/ODBC version, 5.1 (as opposed to SB 3.51). Thus her mistake. Me? I have no excuse. I'm just so used to work with UTF-8 that I fail to recognize other options :)

> I would really like mysql-connector-java to use utf8 by default, will you fix it?

No as there is nothing to "fix". Answer is in the presentation. You are always free to use UTF8 when you want/need to.

So, MyODBC has separate SB and MB charsets versions (3.51 and 5.1 respectively).

c/J utilizes character_set_server because if client does not define it otherwise, this is what will be used as a default for the client anyways. So, SET NAMES latin1 sets character_set_client, character_set_result and character_set_connection to latin1. Instantly after that, (ALL) connectors set character_set_result to NULL (as can be seen in logs) to allow for auto-discovery of actual charset data is in. This is only because *YOU* defined your server and client as such.

Regarding charactet_set_results, c/NET uses similar approach. It does 1 of 3 things:
  (1) set charactet_set_results=null
  (2) set character_set_results=<charset given by user on connection string>
  (3) set character_set_results=<charset given by server upon connection or default>
which is basically the same as c/J does in your case. And it also issues SET NAMES latin1 if conditions met.

Basically, it all comes down to this. ALL three variables Susanne mentioned should be in sync otherwise you will get either garbled output or input. If you need to cover wide range of characters, go for UTF8. If speed/space is of essence, use SB charset.

> Did you try the code I provided?
Yes.

> Did it fail in the UPDATE?
No

Do you think it should fail?
No.

I have tested MySQL 5.1.45-3 (Debian) (x86 and x86-64) and MySQL 5.4.2 (SUSE) (x86)
What version and operating system did you use?
Did you check the charsets where configured and I sent?

Eudardo,

I am missing here what do you think is the bug here?

The bug is the update:
UPDATE mysqlcharsettest SET id = CASE 1 WHEN 0 THEN id ELSE ? END
that fails with:
Incorrect string value: '\xED' for column 'id' at row 1

If you are not able to reproduce the bug I would like to know what database version and OS are you using and what are the charsets configured on the server.
The bug is triggered at MySQL 5.1.45-3 (Debian) (x86 and x86-64) and MySQL 5.4.2 (SUSE) (x86)
Can you confirm or deny the bug using a similar MySQL version and OS and let me know what they are?

Did you set character_set_server to utf8?

No, as you shaw in one of my former posts:
character_set_server     latin1

Were you able to reproduce the bug?

Many thanks for all your informations.

This is not a bug.

When you work with JDBC and characters that are out of range from latin1 then you need to set character_set_server to utf8 or whatever character set supports the letters you want to use.

'í' is not part of latin1. It is part of utf8.

I am sorry to inform you, but 'í' is part of latin1:
http://en.wikipedia.org/wiki/ISO/IEC_8859-1

There is a bug in MySQL Server.

But 'í' that you used is utf8 encoded and you told the server that incoming data are latin1 encoded.

This is not a bug. Either use latin1 in your environment for filling data or when you want to use utf8 on your OS/editor then change variable character_set_server to utf8.

I am sorry to let you know once again that mysql-connector-java is correctly encoding 'í' as latin1 on the wire when:
character_set_server     latin1

What do you think is the bug then?

I am not able to see a bug here. All works as expected.

Could you give an explanation why:
"INSERT INTO mysqlcharsettest (id) VALUES (?)" works
"UPDATE mysqlcharsettest SET id = CASE 1 WHEN 0 THEN id ELSE ? END" does not work
"UPDATE mysqlcharsettest SET id = ?" works
?

Did you look into hex values of your table?
Are values stored wrong?
Is hex value from input utf8 instead of iso-8859-1?

We have no bug here. It looks like you have a charset mix somewhere in your system.

The values must be right. Did you see the general query log posted above ?
The string value in the query is *correct*.

The important piece is:

"UPDATE mysqlcharsettest SET id = CASE 1 WHEN 0 THEN id ELSE ? END" does not work
"UPDATE mysqlcharsettest SET id = ?" works

which was repeated quite a few times.

The reason must be the CASE operator. It aggregates the types of all possible results. This one can return either a string or a number. This gets aggregated, the charset of a result is binary. So, the 'í' string isn't converted from latin1 to utf8 because it's no longer latin1 - it's binary.

It's well known very annoying behavior. But as it's documented it's not aa bug.
Fixed in WL#2649

Do you mean that it is fixed in a new major revision?
Would you fix it in a new major revision?