Bug #28162 Problems with german special characters as öäü etc.
Submitted: 30 Apr 2007 10:43 Modified: 16 May 2007 14:31
Reporter: frank bretschneider Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server Severity:S1 (Critical)
Version:mysql-server-5.0.33 Multithreaded SQL da OS:Any
Assigned to: CPU Architecture:Any

[30 Apr 2007 10:43] frank bretschneider
Description:
Mysql does not handle special characters such as 'öäü' correctly. When inserting such values into the database they are converted into 'aou' etc. Other UTF8 characters ( e.g. japanese characters ) are stored correctly.

How to repeat:
I created an utf8 database.

-------------------------

CREATE TABLE `bb1_wordlist` (
  `word` varchar(50) collate latin1_german1_ci NOT NULL default '',
  UNIQUE KEY `word` (`word`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci ;

INSERT INTO `bb1_wordlist` VALUES ('wärs');
INSERT INTO `bb1_wordlist` VALUES ('wars');

ERROR 1062 : Duplicate entry 'wars' for key 1

-----------------------------

CREATE TABLE `bb1_wordlist2` (
  `word` varchar(50) NOT NULL default '',
  UNIQUE KEY `word` (`word`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci ;

INSERT INTO `bb1_wordlist2` VALUES ('wärs');
INSERT INTO `bb1_wordlist2` VALUES ('wars');

ERROR 1062 : Duplicate entry 'wars' for key 1

-----------------------------

I also tried:

ALTER TABLE bb1_wordlist CHARACTER SET `binary` COLLATE `binary`;

INSERT INTO `bb1_wordlist` VALUES ('wärs');
INSERT INTO `bb1_wordlist` VALUES ('wars');

ERROR 1062 : Duplicate entry 'wars' for key 1

-----------------------------

I found this bug reported by another user. But the status had not been changed after my posting. The bug should have been fixed on a previous version but is not.

Frank

Suggested fix:
treat a different than ä. same for all other umlaute.
[30 Apr 2007 11:51] MySQL Verification Team
Thank you for the bug report. Accordin our Manual:
http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html

The latin1_german1_ci and latin1_german2_ci collations are based on the DIN-1 and DIN-2 standards, where DIN stands for Deutsches Institut für Normung (the German equivalent of ANSI). DIN-1 is called the “dictionary collation” and DIN-2 is called the “phone book collation.” 

latin1_german1_ci (dictionary) rules: 
Ä = A
Ö = O
Ü = U
ß = s

What is the bug report are you referring in yoour comment:

"I found this bug reported by another user. But the status had not been changed
after my posting. The bug should have been fixed on a previous version but is
not."

Thanks in advance.
[30 Apr 2007 11:58] frank bretschneider
I am refering to the following bug:

http://bugs.mysql.com/bug.php?id=17985

And the bug is not about the order of a select statement but about the value that you insert. öäü are all UTF-8 characters. So when I do an insert statement with 'öäü' characters I must get these characters from a select statement as well.

Frank
[30 Apr 2007 12:27] MySQL Verification Team
Thank you for the feedback. I am still not able to repeat with current source
server. I am using the KDE Konsole with default encoding:

[miguel@light 5.0]$ bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.0.42-debug Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> CREATE TABLE `bb1_wordlist` (
    ->   `word` varchar(50) collate latin1_german1_ci NOT NULL default '',
    ->   UNIQUE KEY `word` (`word`)
    -> ) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci ;
Query OK, 0 rows affected (0.06 sec)

mysql>
mysql> INSERT INTO `bb1_wordlist` VALUES ('wärs');
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO `bb1_wordlist` VALUES ('wars');
Query OK, 1 row affected (0.00 sec)

mysql>
mysql> select * from bb1_wordlist;
+-------+
| word  |
+-------+
| wars  |
| wärs |
+-------+
2 rows in set (0.00 sec)

Which terminal are you using?
[30 Apr 2007 15:16] frank bretschneider
I am using to clients

 1. navicat windows client
 2. mysql-client-5.0.27 Multithreaded SQL database (client)

What I did now is:

 created a iso-8859-1 database
 created a utf8 database

I executed the following sql on both databases:

CREATE TABLE `bb1_wordlist` (
  `word` varchar(50) collate latin1_german1_ci NOT NULL default '',
   UNIQUE KEY `word` (`word`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci ;

truncate table bb1_wordlist;
INSERT INTO `bb1_wordlist` VALUES ('wärs');
INSERT INTO `bb1_wordlist` VALUES ('wars');

I also create a test table with the default characterset utf8. Every time I execute the SQL I get the same error.

ERROR 1062 : Duplicate entry 'wars' for key 1
[4 May 2007 7:46] Hartmut Holzgraefe
i can't reproduce this either, this is probably due to client and connection charset settings and not due to the database and table charset.

what are your results for 

  SHOW VARIABLES LIKE 'char%';

and 

 SHOW VARIABLES LIKE 'coll%';

?

Mine are the server defaults:

mysql> show variables like 'char%';
+--------------------------+-----------------------------------------------+
| Variable_name            | Value                                         |
+--------------------------+-----------------------------------------------+
| character_set_client     | latin1                                        | 
| character_set_connection | latin1                                        | 
| character_set_database   | latin1                                        | 
| character_set_filesystem | binary                                        | 
| character_set_results    | latin1                                        | 
| character_set_server     | latin1                                        | 
| character_set_system     | utf8                                          | 
| character_sets_dir       | /usr/local/mysql-5.0.37/share/mysql/charsets/ | 
+--------------------------+-----------------------------------------------+
8 rows in set (0.38 sec)

mysql> show variables like 'coll%';
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci | 
| collation_database   | latin1_swedish_ci | 
| collation_server     | latin1_swedish_ci | 
+----------------------+-------------------+
3 rows in set (0.00 sec)

and your example works fine with these settings ...
[4 May 2007 10:21] frank bretschneider
My results are:

mysql> SHOW VARIABLES LIKE 'char%';

+--------------------------+----------------------------------+
| Variable_name            | Value                            |
+--------------------------+----------------------------------+
| character_set_client     | utf8                             |
| character_set_connection | utf8                             |
| character_set_database   | utf8                             |
| character_set_filesystem | binary                           |
| character_set_results    | utf8                             |
| character_set_server     | latin1                           |
| character_set_system     | utf8                             |
| character_sets_dir       | /usr/local/share/mysql/charsets/ |
+--------------------------+----------------------------------+
8 rows in set

mysql> SHOW VARIABLES LIKE 'coll%';
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | utf8_general_ci   |
| collation_database   | utf8_general_ci   |
| collation_server     | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set

So it is all utf8. 

Even if the characterset would be wrong within the connection the primary key should still not say that I try to insert a double key because ä and a sould be a different character in any characterset that contains ä and a.

When I generate an none utf-8 database it works. So the problem is the UTF-8 database. With these parameters it works fine.

mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------------+
| Variable_name            | Value                            |
+--------------------------+----------------------------------+
| character_set_client     | utf8                             |
| character_set_connection | utf8                             |
| character_set_database   | latin1                           |
| character_set_filesystem | binary                           |
| character_set_results    | utf8                             |
| character_set_server     | latin1                           |
| character_set_system     | utf8                             |
| character_sets_dir       | /usr/local/share/mysql/charsets/ |
+--------------------------+----------------------------------+
8 rows in set

mysql> SHOW VARIABLES LIKE 'coll%';
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | utf8_general_ci   |
| collation_database   | latin1_swedish_ci |
| collation_server     | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set
[8 May 2007 17:38] MySQL Verification Team
Thank you for the feedback. This is expected behavior how explained in the
Manual: http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html
and when the server not rejects the duplicate key you can see the actual
character using the hex() function.
[9 May 2007 12:22] frank bretschneider
If this is not a bug, why does the statement work on a latin1 database but not on an utf8-database. The table is latin1 and uses latin1 collation.

CREATE TABLE `bb1_wordlist` (
  `word` varchar(50) collate latin1_german1_ci NOT NULL default '',
  UNIQUE KEY `word` (`word`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci ;

INSERT INTO `bb1_wordlist` VALUES ('wärs');
INSERT INTO `bb1_wordlist` VALUES ('wars');

Besides ...

If it is not a bug, I do not understand how I can create an utf-8 table with a unique or primary index on a column where I must be able store UTF-8 characters ( including latin1 äüö etc ). How do I solve the problem. Any ideas appreciated.

frank
I would like to know how to use an utf8 database with german special characters
[10 May 2007 19:25] Paul DuBois
From the 4 May 12:21 comment:

>Even if the characterset would be wrong within the
>connection the primary key should still not say that
>I try to insert a double key because ä and a sould be
>a different character in any characterset that contains
>ä and a.

That is not true. A primary key cannot contain two values that compare equal to each other, but the question of "equal" for a character column is a function of the collation.  For example, in a case-insensitive collation, 'a' and 'A' are equal to each other so you could not store both 'a' and 'A' in a primary key column. The same applies to accented and non-accented characters. They may or may not be equal to each other, depending on the collation. In some collations, 'a' compares equal to 'ä'. For such collations, you will see the duplicate-key error when you try to store both values.

The various errors or non-errors demonstrated in this bug report are probably partly due to whether acccented characters are being transmitted from the client to the server correctly. Correct transmission depends on a number of factors such as the encoding for your input/display device (e.g., your terrminal) and the encoding used for the connection between the MySQL client and server.

Example 1: Terminal set to use ISO-8859-1, shell locale set to ISO-8859-1 so that I can type in accented characters in that character set, and the mysql client set to latin1 character set. Note that  'a' compares equal to 'ä' for one collation but not the other:

mysql> set names latin1 collate latin1_swedish_ci;
Query OK, 0 rows affected (0.00 sec)

mysql> select @@character_set_connection 'charset',
    -> @@collation_connection AS 'collation';
+---------+-------------------+
| charset | collation         |
+---------+-------------------+
| latin1  | latin1_swedish_ci | 
+---------+-------------------+
1 row in set (0.00 sec)

mysql> select 'a' = 'ä';
+-----------+
| 'a' = 'ä' |
+-----------+
|         0 | 
+-----------+
1 row in set (0.00 sec)

mysql> set names latin1 collate latin1_german1_ci;
Query OK, 0 rows affected (0.01 sec)

mysql> select @@character_set_connection 'charset',
    -> @@collation_connection AS 'collation';
+---------+-------------------+
| charset | collation         |
+---------+-------------------+
| latin1  | latin1_german1_ci | 
+---------+-------------------+
1 row in set (0.00 sec)

mysql> select 'a' = 'ä';
+-----------+
| 'a' = 'ä' |
+-----------+
|         1 | 
+-----------+
1 row in set (0.00 sec)

For the latter collation, if you attempt to insert both values into a primary key, a duplicate-key error will occur.

Example 2: Terminal set to use UTF-8, shell locale set to UTF-9 so that I can type in accented characters in that character set, and the mysql client set to utf8 character set. Note that  'a' compares equal to 'ä' for the utf8_general_ci collation:

mysql> set names utf8 collate utf8_general_ci;
Query OK, 0 rows affected (0.00 sec)

mysql> select @@character_set_connection 'charset',
    -> @@collation_connection AS 'collation';
+---------+-----------------+
| charset | collation       |
+---------+-----------------+
| utf8    | utf8_general_ci | 
+---------+-----------------+
1 row in set (0.00 sec)

mysql> select 'a' = 'ä';
+------------+
| 'a' = 'ä' |
+------------+
|          1 | 
+------------+
1 row in set (0.00 sec)

Here too, if you attempt to insert both values into a primary key, a duplicate-key error will occur.

If you want to store both the accented and non-accented characters and have them be considered distinct, you must choose a collation that considers them distinct and not to be the same for comparison purposes.

There is no bug here. The behavior that will occur regarding duplicates is entirely a function of the comparison properties of the collation that is in use.
[15 May 2007 14:55] frank bretschneider
Thank you Paul DuBois,

Thank you for the detailed explaination. Still I am looking for the following problem:

I have to store int. characters in a database. In a table "footest" I need a column where 'ä' and 'a' is not equal.
Can you please advise me how to setup a table for that.

The following SQL must work:

insert into footest (id) values ('Müller');
insert into footest (id) values ('Muller');
insert into footest (id) values ('Mueller');
insert into footest (id) values ('用户名');

column id is the primary key.

Thank you 

Frank
[16 May 2007 14:31] Sergei Golubchik
We're sorry, but the bug system is not the appropriate forum for asking help on using MySQL products. Your problem is not the result of a bug.

Support on using our products is available both free in our forums at http://forums.mysql.com/ and for a reasonable fee direct from our skilled support engineers at http://www.mysql.com/support/

Thank you for your interest in MySQL.