Bug #57860 Unique/Primary Index constraint fails with umlauts
Submitted: 30 Oct 2010 8:43 Modified: 4 Nov 2010 8:17
Reporter: Timo A. Hummel Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.1.51 OS:Linux
Assigned to: CPU Architecture:Any

[30 Oct 2010 8:43] Timo A. Hummel
Description:
Given the following example on a system completely setup for UTF-8:

CREATE TABLE  test (
test VARCHAR( 255 ) NOT NULL ,
PRIMARY KEY ( test)
) ENGINE = INNODB;

MySQL reports a constraint violation if you execute the code below. I've read about quite some bugs where Foeoe and Föö are the same - which is okay (but not too useful with indexes). However, Foo and Föö are two different words.

It seems to be a similar issue as in Bug #27600

If this is a bug, it should be fixed. If it is not a bug, this behaviour should be documented.

How to repeat:
INSERT INTO 'test' (test) VALUES ('Foo'),('Föö');
[30 Oct 2010 8:52] Timo A. Hummel
Okay, I think I know what the problem is. According to the collation manual, Ä and A are treated equal (same goes for other umlauts). Is there *any* chance to get a UTF-8 collation which actually treats Ä and A differently?
[30 Oct 2010 9:00] Peter Laursen
Works fine for me:
SET NAMES UTF8;

CREATE TABLE  test (
test VARCHAR( 255 ) NOT NULL ,
PRIMARY KEY ( test)
) ENGINE = INNODB;

INSERT INTO test (test) VALUES ('Foo'),('Föö'); 

but
SET NAMES latin1;
INSERT INTO test (test) VALUES ('Foo'),('Föö');
-- Error Code : 1062
-- Duplicate entry 'Foo' for key 'PRIMARY'

I do not think your system is 'completely set up for utf8'.  What does "SHOW VARIABLES LIKE 'character%';" return?

Peter
(not a MySQL person)
[30 Oct 2010 9:02] Timo A. Hummel
character_set_client	utf8
character_set_connection	utf8
character_set_database	utf8
character_set_filesystem	binary
character_set_results	utf8
character_set_server	utf8
character_set_system	utf8
character_sets_dir	/usr/share/mysql/charsets/
[30 Oct 2010 9:05] Peter Laursen
So it will make no difference if you "SET NAMES UTF8;"?
This looks to me like an issue with the client environment then.
[30 Oct 2010 9:06] Peter Laursen
Tried a Linux server too (5.1.36 64 bit on RHEL5).  Still works fine for me (using SQLyog as client).
[30 Oct 2010 9:07] Timo A. Hummel
No, this is most probably not an environment problem. As outlined in http://dev.mysql.com/doc/refman/5.1/en/charset-unicode-sets.html it seems to be the correct behaviour.

Did you reproduce that issue with a newer or older version or exactly the version I'm using?
[30 Oct 2010 9:09] Peter Laursen
I used first 5.1.51 64 bit on Windows and next 5.1.36 64 bit on Linux.
[30 Oct 2010 9:10] Peter Laursen
I believe you need to configure your console/client to use utf8.  But I have no clue about this. Just saw some similar replies here.
[30 Oct 2010 9:11] Timo A. Hummel
That's odd. Any other variables I should check? Maybe you could check the same, especially the collations on the table, database and fields?

Mine is all set to utf8_general_ci, same behavior if I change everything to utf8_unicode_ci.
[30 Oct 2010 9:17] Peter Laursen
Are you using a BASH-shell for the client? I think shell configuration is where you should look - but I have no option to work with the client environment on Linux and do not understand it very well. But my charset setting are the same as yours.

Let MySQL support people take over.
[30 Oct 2010 9:20] Timo A. Hummel
Well, it happens from the command line, from phpMyAdmin and even from my source code. I believe you have something else as "utf8_general_ci" or "utf8_unicode_ci" specified as your connection collation, so that might be the reason it works for you?

The key question I'd need to have answered is: Should MySQL convert anything when comparing index constraints or not?
[30 Oct 2010 10:09] Timo A. Hummel
More strict tests which explicitly adds the collation to the table, rather than relying on the defaults:

== Collation utf8_general_ci ==

CREATE TABLE  test2 (
test2 VARCHAR( 255 ) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
PRIMARY KEY (  test2 )
) ENGINE = INNODB CHARACTER SET utf8 COLLATE utf8_general_ci;

INSERT INTO test2 ( test2 ) VALUES ('Foo'), ('Föö')

== Collation utf8_unicode_ci ==

CREATE TABLE  test3 (
test3 VARCHAR( 255 ) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL ,
PRIMARY KEY (  test3 )
) ENGINE = INNODB CHARACTER SET utf8 COLLATE utf8_unicode_ci;

INSERT INTO test3 ( test3 ) VALUES ('Foo'), ('Föö');
[31 Oct 2010 10:05] Peter Laursen
You are right about 'collation equality':

SET character_set_connection = utf8;

SELECT STRCMP((SELECT ('Foo') COLLATE utf8_general_ci),(SELECT ('Föö') COLLATE utf8_general_ci));
-- returns 0 -- they are identical

SELECT STRCMP((SELECT ('Foo') COLLATE utf8_unicode_ci),(SELECT ('Föö') COLLATE utf8_unicode_ci));
-- returns 0 -- they are identical

.. so in my understanding the order where 'Föö' and 'Foo' are returned is compltely random (depends on how they are stored in the tablespace, what is cached etc.)

There sees to be no Gernan utf8 collations.  With German latin1 collations

SET character_set_connection = latin1;
SELECT STRCMP((SELECT ('Foo') COLLATE latin1_german1_ci),(SELECT ('Föö') COLLATE latin1_german1_ci));
-- returns 0 -- they are identical
SELECT STRCMP((SELECT ('Foo') COLLATE latin1_german2_ci),(SELECT ('Föö') COLLATE latin1_german2_ci));
-- returns 1 -- they are different

So this looks like a request for a utf8_german2_ci collation.
[31 Oct 2010 10:27] Peter Laursen
Danish:

SET character_set_connection = utf8;
SELECT STRCMP((SELECT 'ü' COLLATE utf8_danish_ci), (SELECT 'u' COLLATE utf8_danish_ci)); -- (1)
SELECT STRCMP((SELECT 'y' COLLATE utf8_danish_ci), (SELECT 'ü' COLLATE utf8_danish_ci)); -- (0)

.. what is basically OK (but practical language always has exceptions though -  mostly with foreign names)
[4 Nov 2010 8:08] Susanne Ebrecht
This is a duplicate of bug #38758
[4 Nov 2010 8:15] Susanne Ebrecht
Timo,

just to explain it to you.

For foo not equal to föö you will need a collation that won't say that ö is treated as o.

Per default utf8_general_ci is taken which always just compare a single sign and in which ö is equal to o.

Of course you can take utf8_bin here already.
But utf8_bin has the disadvantage that it always take the binary which means also 'A' and 'a' are different.

For latin1 we already have latin1_german1_ci and latin1_german2_ci. Given by German standards (DIN 5007)

German1 means ö = o
German2 means ö = oe

Unfortunately, we don't have German2 collation for utf8 in MySQL 5.0 and 5.1.
But as you can read in the other bug report we implemented utf8_german2 collation into MySQL 5.5.
[4 Nov 2010 8:17] Timo A. Hummel
Hi Susanne,

thank you for your reply. The collation as in MySQL 5.5 is probably what I'm looking for, as I need proper ordering as well as "similar" names in unique-indexed columns. I will test it lateron and see if the implementation is sufficient.

cheers,
Timo