Bug #46688 error message contains incorrect unicode key data
Submitted: 13 Aug 2009 5:54 Modified: 14 Aug 2009 5:32
Reporter: Neil Bacon Email Updates:
Status: Not a Bug Impact on me:
None 
Category:Connector / J Severity:S2 (Serious)
Version:5.0.5, 5.1.6 OS:Any
Assigned to: CPU Architecture:Any
Tags: error message unicode utf8

[13 Aug 2009 5:54] Neil Bacon
Description:
On attempting to insert a row with a duplicate primary key, the error message contains incorrect data for the key. This occurs when saving a large object graph using Hibernate, so the correct data is essential in sorting out the problem.

mysql client/server: 5.1.31-1ubuntu2 (Ubuntu)

How to repeat:
Using mysql command line client:

create database charSetTest DEFAULT CHARACTER SET utf8 COLLATE utf8_bin;
use charSetTest;
create table t (v varchar(20), t text, primary key ( v ));

set character_set_client=utf8;
set character_set_connection=utf8;
set character_set_results=utf8;
set character_set_server=utf8;
show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       | 
| character_set_connection | utf8                       | 
| character_set_database   | utf8                       | 
| character_set_filesystem | binary                     | 
| character_set_results    | utf8                       | 
| character_set_server     | utf8                       | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+

insert into t (v, t) values ('中文', '中文');
select * from t;
+--------+--------+
| v      | t      |
+--------+--------+
| 中文 | 中文 | 
+--------+--------+
1 row in set (0.00 sec)

insert into t (v, t) values ('中文', '中文');
ERROR 1062 (23000): Duplicate entry '中文' for key 'PRIMARY'

OK so far, now show the problem with JDBC:

jdbc url: jdbc:mysql://localhost/charSetTest?characterEncoding=UTF-8&characterSetResults=UTF-8

set character_set_server=utf8;
show variables like 'char%';

Variable_name	Value
character_set_client	utf8
character_set_connection	utf8
character_set_database	utf8
character_set_filesystem	binary
character_set_results	utf8
character_set_server	utf8
character_set_system	utf8
character_sets_dir	/usr/share/mysql/charsets/

select * from t;

v	t
中文	中文

insert into t (v, t) values ('中文', '中文');

Error: Duplicate entry '中文' for key 'PRIMARY'
SQLState:  23000
ErrorCode: 1062

Here the data within the quotes should have been '中文', as it is with the mysql command line client.
[13 Aug 2009 6:39] Mark Matthews
The issue is that mysql doesn't actually support unicode in its error messages. This works with the mysql client because it is charset unaware, your terminal is interpreting the "generic" bytes that get spit back by mysqld as UTF-8.

The JDBC driver isn't so lucky. It has to pick a character set to go from byte[] to char[] so that it can make java.lang.Strings. The only way for it to do this for error messages is to look at the server variable "language" to determine what character encoding is in use. Unfortunately, none of them are utf-8.

There are plans for making all error messages in the server sent via UTF-8. When that happens, the driver will support this.

There might be a chance for someone to hack in support to always treat error messages as utf-8, even though they aren't, but at the moment it is not on any roadmap.
[14 Aug 2009 5:16] Neil Bacon
patch to use characterEncoding url param for error messages

Attachment: com.mysql.jdbc.CharsetMapping.java.patch (text/x-patch), 663 bytes.

[14 Aug 2009 5:32] Neil Bacon
The driver uses the following heuristic to guess the encoding for error messages:
get "language" property from server
map to character set: "english" -> "latin1"
map to java encoding: "latin1" -> "Cp1252"
(using hard coded mappings).

The patch I've just attached will use the url parameter "characterEncoding", if specified, in preference to the above. A new parameter could be added for this purpose - you've got to let the user override this heuristic somehow.

Now the error message is useful. Isn't open source sweet.
Cheers,
    Neil.