Bug #10684 accented characters are not unique
Submitted: 17 May 2005 11:40 Modified: 28 Jun 2005 13:58
Reporter: Leos Literak Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:4.1.11 OS:Linux (Debian Linux)
Assigned to: CPU Architecture:Any

[17 May 2005 11:40] Leos Literak
Description:
When I set default encoding to latin2, mysql starts to think, that accented characters are equal to nonaccented.

I was trying to install mysqldump from one installation to another one (with latin2 as default encoding) and it failed, that uniqueness constraint is broken. After digging in 100 MB file I realized that mysql thinks, that 'zdenek' and 'zdeněk' are same word! E with caron is not the same character as E without accent.

See http://lists.mysql.com/dotnet/493 for similar issue

How to repeat:
create table A (klic VARACHAR(50) UNIQUE);
insert into A values('zdenek' );
insert into A values('zdeněk' );

Suggested fix:
Do not treat accented characters as identical to nonaccented characters. It is nonsense.
[2 Jun 2005 15:30] Jorge del Conde
I was able to reproduce this using 5.0.4.

Thanks for your bug report
[3 Jun 2005 15:18] Jorge del Conde
Hi!

Can you please provide us with the output of the following:

mysql> show variables like 'coll%';

The reason for this is your bug report might be related to an incorrect collation used for the latin2 characterset, thus its something we need to check first.

To read more about collations, please refere to the following section in our manual: 10.3 :  http://dev.mysql.com/doc/mysql/en/charset-defaults.html

Thanks a lot!
[9 Jun 2005 4:57] Leos Literak
My admin set up new database and it works on it. I do not know, what is its settings. So the output of this command may be different than in original scenerio, that I reported. I do not have admin privileges there :-(

mysql> show variables like 'coll%';
| collation_connection | latin2_general_ci |
| collation_database   | latin2_czech_cs   |
| collation_server     | latin2_general_ci |
[28 Jun 2005 13:58] Aleksey Kishkin
Hi! if you set right collation, zdeněk and zdenek are different word:

mysql> create table A (klic VARCHAR(50) UNIQUE) charset=latin2 collate=latin2_czech_cs;
Query OK, 0 rows affected (0.02 sec)

mysql> insert into A values('zdenek' );
Query OK, 1 row affected (0.00 sec)

mysql> insert into A values('zdeněk' );
Query OK, 1 row affected (0.00 sec)

mysql> select * from A;
+--------+
| klic   |
+--------+
| zdenek |
| zdeněk |
+--------+
2 rows in set (0.00 sec)

the only way to reproduce  is -  to use latin2_general_ci collation (default for latin2). So,

mysql> drop table A;
Query OK, 0 rows affected (0.00 sec)

mysql> create table A (klic VARCHAR(50) UNIQUE) charset=latin2;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into A values('zdenek' );
Query OK, 1 row affected (0.00 sec)

mysql> insert into A values('zdeněk' );
ERROR 1062 (23000): Duplicate entry 'zdeněk' for key 1

But it's not a bug, because you choose wrong collation.