MySQL Bugs: #10684: accented characters are not unique

Bug #10684	accented characters are not unique
Submitted:	17 May 2005 11:40	Modified:	28 Jun 2005 13:58
Reporter:	Leos Literak	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server	Severity:	S2 (Serious)
Version:	4.1.11	OS:	Linux (Debian Linux)
Assigned to:		CPU Architecture:	Any

Description:
When I set default encoding to latin2, mysql starts to think, that accented characters are equal to nonaccented.

I was trying to install mysqldump from one installation to another one (with latin2 as default encoding) and it failed, that uniqueness constraint is broken. After digging in 100 MB file I realized that mysql thinks, that 'zdenek' and 'zdeněk' are same word! E with caron is not the same character as E without accent.

See http://lists.mysql.com/dotnet/493 for similar issue

How to repeat:
create table A (klic VARACHAR(50) UNIQUE);
insert into A values('zdenek' );
insert into A values('zdeněk' );

Suggested fix:
Do not treat accented characters as identical to nonaccented characters. It is nonsense.

I was able to reproduce this using 5.0.4.

Thanks for your bug report

Hi!

Can you please provide us with the output of the following:

mysql> show variables like 'coll%';

The reason for this is your bug report might be related to an incorrect collation used for the latin2 characterset, thus its something we need to check first.

To read more about collations, please refere to the following section in our manual: 10.3 :  http://dev.mysql.com/doc/mysql/en/charset-defaults.html

Thanks a lot!

My admin set up new database and it works on it. I do not know, what is its settings. So the output of this command may be different than in original scenerio, that I reported. I do not have admin privileges there :-(

mysql> show variables like 'coll%';
| collation_connection | latin2_general_ci |
| collation_database   | latin2_czech_cs   |
| collation_server     | latin2_general_ci |

Hi! if you set right collation, zdeněk and zdenek are different word:

mysql> create table A (klic VARCHAR(50) UNIQUE) charset=latin2 collate=latin2_czech_cs;
Query OK, 0 rows affected (0.02 sec)

mysql> insert into A values('zdenek' );
Query OK, 1 row affected (0.00 sec)

mysql> insert into A values('zdeněk' );
Query OK, 1 row affected (0.00 sec)

mysql> select * from A;
+--------+
| klic   |
+--------+
| zdenek |
| zdeněk |
+--------+
2 rows in set (0.00 sec)

the only way to reproduce  is -  to use latin2_general_ci collation (default for latin2). So,

mysql> drop table A;
Query OK, 0 rows affected (0.00 sec)

mysql> create table A (klic VARCHAR(50) UNIQUE) charset=latin2;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into A values('zdenek' );
Query OK, 1 row affected (0.00 sec)

mysql> insert into A values('zdeněk' );
ERROR 1062 (23000): Duplicate entry 'zdeněk' for key 1

But it's not a bug, because you choose wrong collation.