Bug #81299 Unique index does not handle German umlauts properly
Submitted: 3 May 2016 21:08 Modified: 6 May 2016 10:17
Reporter: Martin Kirchner Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: DML Severity:S2 (Serious)
Version:5.6.30, 5.7.12 OS:Ubuntu
Assigned to: CPU Architecture:Any

[3 May 2016 21:08] Martin Kirchner
Description:
The two values 'Rösner' and 'Roßner' are handled as the same value and the unique index does not allow an insert of both values.

How to repeat:
root@localhost [testtags]> create table tags2 (id int(10) not null auto_increment, v varchar(20) not null, unique key idx_v (v), primary key (id)) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Table containing all tags';
Query OK, 0 rows affected (0.04 sec)

root@localhost [testtags]> insert into tags2 (v) values ('Roßner'), ('Rösner');
ERROR 1062 (23000): Duplicate entry 'Rösner' for key 'idx_v'

root@localhost [testtags]> show global variables like '%collat%';
+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database   | utf8_general_ci |
| collation_server     | utf8_general_ci |
+----------------------+-----------------+
3 rows in set (0.01 sec)
[3 May 2016 23:27] MySQL Verification Team
Thank you for the bug report.

C:\dbs>c:\dbs\5.7\bin\mysql -uroot -p --port=3570 --prompt="mysql 5.7 > "
Enter password: ******
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.7.13 Source distribution PULL: 2016-APR-07

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql 5.7 > use test
Database changed
mysql 5.7 > drop table tags2;
Query OK, 0 rows affected (0.26 sec)

mysql 5.7 > create table tags2 (id int(10) not null auto_increment, v varchar(20) not null, unique key idx_v (v), primary key (id)) ENGINE=I
le containing all tags';
Query OK, 0 rows affected (0.45 sec)

mysql 5.7 > insert into tags2 (v) values ('Roßner');
Query OK, 1 row affected (0.06 sec)

mysql 5.7 > insert into tags2 (v) values ('Rösner');
ERROR 1062 (23000): Duplicate entry 'Rösner' for key 'idx_v'
mysql 5.7 > show global variables like '%collat%';
+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_unicode_ci |
| collation_database   | utf8_unicode_ci |
| collation_server     | utf8_unicode_ci |
+----------------------+-----------------+
3 rows in set (0.00 sec)
[4 May 2016 9:21] Bernt Marius Johnsen
This is not a bug. 

For the collation utf8_general_ci 'ö' == 'o' and 's' == 'ß' and this is correct according to the Unicode UCA algorithm.

You need to use the utf8_german2_ci collation to get German-specific handling of 'ö', 'ß' etc.:

mysql> create table tags2 (id int(10) not null auto_increment, v varchar(20) not null collate utf8_german2_ci, unique key idx_v (v), primary key (id)) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Table containing all tags';
Query OK, 0 rows affected (0.01 sec)

mysql> insert into tags2 (v) values ('Roßner')('Rösner');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

This collation will also handle 'ss' vs 'ß' correctly. e.g:

mysql> insert into tags2 (v) values ('Straße'), ('STRASSE');
ERROR 1062 (23000): Duplicate entry 'STRASSE' for key 'idx_v'
mysql>
[4 May 2016 9:38] Martin Kirchner
Well, then it is at least a changed behavior. Under MySQL 5.0 it worked.
I created a dump under MySQL 5.0.91 (Percona) and tried to import it in 5.6 and 5.7. So the source table contains several values that differ only in the umlauts.

And the collation is the same:

Server version: 5.0.91-enterprise-log MySQL Enterprise Server (GPL)

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

root@localhost [a21e9f241b2f844ec9cd912d7be2ab560]>show global variables like '%coll%';
+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database   | utf8_general_ci |
| collation_server     | utf8_general_ci |
+----------------------+-----------------+
3 rows in set (0.00 sec)

@Bernt: What would be your proposed migration strategy?
[4 May 2016 11:34] Bernt Marius Johnsen
@Martin: show global variables like '%collat%'; 
Does not give a complete picture of the collation/character set of your table.
show create table 
and
show character set 
for the character sets mentioned in the table definition is needed.
[4 May 2016 13:14] Martin Kirchner
Hi Bernt,

as I said, I want to restore a dump created on 5.0 using mysqldump. So I can't influence the SQL that much.
It contains this table definition:

CREATE TABLE `tags` (
  `GGUID` binary(16) NOT NULL,
  `TAGGROUPGUID` binary(16) NOT NULL,
  `SYSTEMDEFINED` tinyint(1) NOT NULL default '0',
  `VALUEDE` varchar(255) default NULL,
  `VALUEEN` varchar(255) default NULL,
  `VALUEFR` varchar(255) default NULL,
  `VALUEIT` varchar(255) default NULL,
  `VALUEHU` varchar(255) default NULL,
  `VALUERO` varchar(255) default NULL,
  `VALUECS` varchar(255) default NULL,
  `USEREDITABLE` tinyint(1) NOT NULL default '0',
  `SORTINDEX` int(11) NOT NULL default '-1',
  `DELETABLE` tinyint(1) NOT NULL default '1',
  PRIMARY KEY  (`GGUID`),
  UNIQUE KEY `IDX_Tags_ValueDE_TagGroupGUID` (`VALUEDE`,`TAGGROUPGUID`),
  UNIQUE KEY `IDX_Tags_ValueEN_TagGroupGUID` (`VALUEEN`,`TAGGROUPGUID`),
  UNIQUE KEY `IDX_Tags_ValueFR_TagGroupGUID` (`VALUEFR`,`TAGGROUPGUID`),
  UNIQUE KEY `IDX_Tags_ValueIT_TagGroupGUID` (`VALUEIT`,`TAGGROUPGUID`),
  UNIQUE KEY `IDX_Tags_ValueHU_TagGroupGUID` (`VALUEHU`,`TAGGROUPGUID`),
  UNIQUE KEY `IDX_Tags_ValueRO_TagGroupGUID` (`VALUERO`,`TAGGROUPGUID`),
  UNIQUE KEY `IDX_Tags_ValueCS_TagGroupGUID` (`VALUECS`,`TAGGROUPGUID`),
  KEY `IDX_TagGroupGUID` (`TAGGROUPGUID`),
  KEY `IDX_UserEditable` (`USEREDITABLE`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Table containing all tags';

And also the insert in one of the unique indexes.
BTW: "SHOW CREATE TABLE tags" returns the same definition.

SHOW CHARACTER SET returns under 5.0:

root@localhost [a21e9f241b2f844ec9cd912d7be2ab560]>show character set like 'utf8';
+---------+---------------+-------------------+--------+
| Charset | Description   | Default collation | Maxlen |
+---------+---------------+-------------------+--------+
| utf8    | UTF-8 Unicode | utf8_general_ci   |      3 |
+---------+---------------+-------------------+--------+
1 row in set (0.00 sec)

Under 5.7.12:
root@localhost [none]> show character set like 'utf8';
+---------+---------------+-------------------+--------+
| Charset | Description   | Default collation | Maxlen |
+---------+---------------+-------------------+--------+
| utf8    | UTF-8 Unicode | utf8_general_ci   |      3 |
+---------+---------------+-------------------+--------+
1 row in set (0.01 sec)

My expectation actually is that I can restore the MySQL 5.0 dump under 5.6 or 5.7.
[4 May 2016 18:18] Martin Kirchner
Same error occurs if I use the German collation:

+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_german2_ci |
| collation_database   | utf8_german2_ci |
| collation_server     | utf8_german2_ci |
+----------------------+-----------------+
[6 May 2016 10:17] Bernt Marius Johnsen
This is an duplicate of bug#43593.

A workaround is described there.