Bug #7558 UTF8 and unique/primary keys
Submitted: 28 Dec 2004 16:10 Modified: 30 Dec 2004 14:10
Reporter: Thilo Raufeisen Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:4.1.8/4.1.9 OS:Linux (Linux Gentoo)
Assigned to: CPU Architecture:Any

[28 Dec 2004 16:10] Thilo Raufeisen
Description:
Hi,
charactersets and collations are set to utf8

character_set_client  	utf8
character_set_connection 	utf8
character_set_database 	utf8
character_set_results 	utf8
character_set_server 	utf8
character_set_system 	utf8
character_sets_dir 	/usr/share/mysql/charsets/
collation_connection  	utf8_general_ci
collation_database 	utf8_general_ci
collation_server 	utf8_general_ci

The seconds insert-query fails (duplicate entry) but the values are different.

How to repeat:
CREATE TABLE `testtab` (
  `eintrag` varchar(255) NOT NULL default '',
  PRIMARY KEY  (`eintrag`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

INSERT INTO `testtab` VALUES ('tésteintrag');
INSERT INTO `testtab` VALUES ('testeintrag');
[28 Dec 2004 18:36] MySQL Verification Team
I tested on Slackware with latest BK 4.1 source tree and got a different
behavior because the first insert was truncated:

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2 to server version: 4.1.9-debug-log

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> show variables like "%character%";
+--------------------------+--------------------------------------------+
| Variable_name            | Value                                      |
+--------------------------+--------------------------------------------+
| character_set_client     | utf8                                       |
| character_set_connection | utf8                                       |
| character_set_database   | utf8                                       |
| character_set_results    | utf8                                       |
| character_set_server     | utf8                                       |
| character_set_system     | utf8                                       |
| character_sets_dir       | /home/miguel/dbs/4.1/share/mysql/charsets/ |
+--------------------------+--------------------------------------------+
7 rows in set (0.00 sec)

mysql> show variables like "%collation%";
+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database   | utf8_general_ci |
| collation_server     | utf8_general_ci |
+----------------------+-----------------+
3 rows in set (0.01 sec)

mysql> CREATE TABLE `testtab` (
    ->   `eintrag` varchar(255) NOT NULL default '',
    ->   PRIMARY KEY  (`eintrag`)
    -> ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> 
mysql> INSERT INTO `testtab` VALUES ('tésteintrag');
Query OK, 1 row affected, 1 warning (0.00 sec)

mysql> INSERT INTO `testtab` VALUES ('testeintrag'); 
Query OK, 1 row affected (0.00 sec)

mysql> select * from testtab;
+-------------+
| eintrag     |
+-------------+
| t           |
| testeintrag |
+-------------+
2 rows in set (0.00 sec)

mysql>
[29 Dec 2004 7:36] Heikki Tuuri
Hi!

I guess that e and accent-e SHOULD be identified in the general UTF-8 collation. Thus, the original problem is not a bug.

But I do not see why Miguel got the value truncated. What kind of character sets are you using in the computer, Miguel?

In hundin, I was able to repeat the original behavior.

Regards,

Heikki

heikki@hundin:~/mysql-4.1/client> ./mysql test
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1 to server version: 4.1.9-debug-log

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql>
mysql>
mysql>
mysql> CREATE TABLE `testtab` (
    ->   `eintrag` varchar(255) NOT NULL default '',
    ->   PRIMARY KEY  (`eintrag`)
    -> ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.13 sec)

mysql> INSERT INTO `testtab` VALUES ('tésteintrag');
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO `testtab` VALUES ('testeintrag');
ERROR 1062 (23000): Duplicate entry 'testeintrag' for key 1
mysql>
[29 Dec 2004 8:19] Sergei Golubchik
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.mysql.com/documentation/ and the instructions on
how to report a bug at http://bugs.mysql.com/how-to-report.php

Additional info:

This is not a bug. How characters are compared are defined by the collation. In utf8_general_ci "é" = "e". Use a proper collation to get the results you need.
http://dev.mysql.com/doc/mysql/en/Charset-Unicode-sets.html

(As for Miguel's results it's apparently a problem of the terminal settings. In any case it's a different issue)
[29 Dec 2004 9:54] MySQL Verification Team
Heikki,

I though at first glance was my terminal but then I tested with default
character set and got the original behavior reported like you got.
So it is a terminal issue when using the same terminal and changing
the character-set the behavior are different ?

miguel@hegel:~/dbs/4.1$ bin/mysql -uroot test
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2 to server version: 4.1.9-debug-log

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> show variables like "%character%";
+--------------------------+--------------------------------------------+
| Variable_name            | Value                                      |
+--------------------------+--------------------------------------------+
| character_set_client     | latin1                                     |
| character_set_connection | latin1                                     |
| character_set_database   | latin1                                     |
| character_set_results    | latin1                                     |
| character_set_server     | latin1                                     |
| character_set_system     | utf8                                       |
| character_sets_dir       | /home/miguel/dbs/4.1/share/mysql/charsets/ |
+--------------------------+--------------------------------------------+
7 rows in set (0.00 sec)

mysql> show variables like "%collation%";
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
| collation_database   | latin1_swedish_ci |
| collation_server     | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)

mysql> CREATE TABLE `testtab` (
    -> `eintrag` varchar(255) NOT NULL default '',
    -> PRIMARY KEY  (`eintrag`)
    -> ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.06 sec)

mysql> INSERT INTO `testtab` VALUES ('tésteintrag');
Query OK, 1 row affected (0.02 sec)

mysql> INSERT INTO `testtab` VALUES ('testeintrag');
ERROR 1062 (23000): Duplicate entry 'testeintrag' for key 1
mysql> exit
Bye
[30 Dec 2004 14:10] Sergei Golubchik
Ah, I see. Then it's "not a bug".

You know, when you set character_set_client to utf8, the server expects you to send correctly formed utf8 strings. But you terminal is not utf8 and the string 'tésteintrag' is sent in latin1 charset. As it's not valid utf8 string, it is truncated on the first invalid character.