Bug #19567 Case insensitive multilingual unicode collations
Submitted: 5 May 2006 17:19 Modified: 27 Feb 2009 7:48
Reporter: Lutz Schwarz Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Charsets Severity:S4 (Feature request)
Version:5.0.22, 4.1.15 OS:Microsoft Windows (windows)
Assigned to: Assigned Account CPU Architecture:Any
Triage: Needs Triage: D5 (Feature request)

[5 May 2006 17:19] Lutz Schwarz
Description:
All currently available case insensitive multilangual collations (e.g. utf8_general_ci and ucs2_general_ci) treat different characters the same even if they differ in more than just upper/lower case.

This problem does not happen with case insensitive multilingual 8-bit collations (e.g. latin1_general_ci).

This bug prevents that data can be converted from a case insensitive multilingual 8-bit character set to a case insensitive multilingual unicode character set.

The problem applies to utf8_unicode_ci and ucs2_unicode_ci as well, but for this collations this could be the desired behaviour.

It seems that the general unicode collations behave like german collations should behave. This could be the reason that there is no unicode collation for german.

How to repeat:
mysql> select convert('a' using latin1) = convert('ä' using latin1) collate latin1_general_ci;
+---------------------------------------------------------------------------------+
| convert('a' using latin1) = convert('ä' using latin1) collate latin1_general_ci |
+---------------------------------------------------------------------------------+
|                                                                               0 |
+---------------------------------------------------------------------------------+
1 row in set (0.00 sec)

# The above was an expected result

mysql> select convert('a' using utf8) = convert('ä' using utf8) collate utf8_general_ci;
+---------------------------------------------------------------------------+
| convert('a' using utf8) = convert('ä' using utf8) collate utf8_general_ci |
+---------------------------------------------------------------------------+
|                                                                         1 |
+---------------------------------------------------------------------------+

# The above was an unexpected result. I could image a workaround like the following:

mysql> select upper(convert('a' using utf8)) = upper(convert('ä' using utf8)) collate utf8_bin;
+----------------------------------------------------------------------------------+
| upper(convert('a' using utf8)) = upper(convert('ä' using utf8)) collate utf8_bin |
+----------------------------------------------------------------------------------+
|                                                                                0 |
+----------------------------------------------------------------------------------+

# The above was an expected result. But there is no workaround for the following:

mysql> show create table person\G
*************************** 1. row ***************************
       Table: person
Create Table: CREATE TABLE `person` (
  `name` varchar(255) collate latin1_general_ci NOT NULL default '',
  PRIMARY KEY  (`name`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci
1 row in set (0.02 sec)

mysql> select * from person;
+------+
| name |
+------+
| A    |
| ä    |
+------+
2 rows in set (0.00 sec)

mysql> alter table person convert to CHARSET utf8 COLLATE utf8_general_ci;
ERROR 1062 (23000): Doppelter Eintrag 'ä' für Schlüssel 1

Suggested fix:
It seems that the uni_plane[] data in ctype-utf8 is wrong for (at least) the sort member of MY_UNICASE_INFO; it contains values that suit german collations.

If the uni_plane[] tables got fixed, a german unicode collation should be supplied and implemented via UCA like utf8_turkish_ci etc.

A different way to fix the problem could be to implement new unicode collations like "utf8_bin_but_ci" and "ucs2_bin_but_ci" (binary but case insensitive, something like that...). This is a dirty but downward compatible fix I could live with.
[5 May 2006 17:48] Lutz Schwarz
files for ucs2_bin_but_ci solution

Attachment: m_ctype.h (application/octet-stream, text), 18.37 KiB.

[5 May 2006 17:48] Lutz Schwarz
files for ucs2_bin_but_ci solution

Attachment: charset-def.c (application/octet-stream, text), 6.97 KiB.

[5 May 2006 17:48] Lutz Schwarz
files for ucs2_bin_but_ci solution

Attachment: ctype-ucs2.c (application/octet-stream, text), 41.56 KiB.

[5 May 2006 17:49] Lutz Schwarz
files for ucs2_bin_but_ci solution

Attachment: charset-def.c (application/octet-stream, text), 6.97 KiB.

[5 May 2006 17:49] Lutz Schwarz
files for ucs2_bin_but_ci solution

Attachment: ctype-ucs2.c (application/octet-stream, text), 41.50 KiB.

[20 Jun 2006 12:46] Valeriy Kravchuk
Thank you for a problem report and patches. Verified just as described, also with 5.0.22 on Windows:

mysql> select version();
+---------------------+
| version()           |
+---------------------+
| 5.0.22-community-nt |
+---------------------+
1 row in set (0.00 sec)

mysql> CREATE TABLE `person` (
    ->   `name` varchar(255) collate latin1_general_ci NOT NULL default ''
    ->   PRIMARY KEY  (`name`)
    -> ) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;
Query OK, 0 rows affected (0.05 sec)

mysql> INSERT INTO `person` VALUES ('A'), ('ä');
Query OK, 2 rows affected (0.03 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> alter table person convert to CHARSET utf8 COLLATE utf8_general_ci;
ERROR 1062 (23000): Duplicate entry 'ä' for key 1
[21 Jun 2006 12:08] Valeriy Kravchuk
As you just want to add another, new collation, it is a feature request.
[22 Jun 2006 10:40] Lutz Schwarz
A feature? Sorry, but I do not agree with that. The canonical fix is to correct behaviour of case insensitive multilingual unicode collations (namely utf8_general_ci and ucs2_general_ci) in a way, that they do what their names propose.

My suggestion to introduce new collations that behave like utf8_general_ci/ucs2_general_ci should behave is a poor hack - not a feature.

Since workaround is not to use unicode at all, I set severity to S3 as a compromise. In fact it is a serious bug because it prevents upgrading databases to unicode for use in international environments and markets.
[23 Jun 2006 9:09] Valeriy Kravchuk
In fact, latin1_general_ci and utf8_general_ci are not similar at all. And this is know fact. So, utf8_general_ci is just inappropriate name for current collation as it is implemented. At least this is a bug. "Proper" implementation of "correct" utf8_general_ci is a reasonable feature request.
[28 Jun 2006 6:40] Alexander Barkov
After discussing with the team, we agreed that:

- this is not a bug. This is a feature request a new
accent sensitive collation for utf8 and ucs2.

- latin1_general_ci and utf8_general_ci behave differently,
and this is confusing. We're sorry about that.

- When we added latin1_general_ci, its name was chosen in a mistake.

- we won't fix latin1_general_ci to behave accent insensitivly,
like utf8_general_ci does. Changing order at this point would
be confusing too.
[26 Feb 2009 20:15] Marc Larue
I don't get what you guys are doing. This is what we need: A way to search for strings in both case in- and sensitive ways with complete UTF-8 characterset on innodb, preferably with full text index. Now how hard can that be? ;)

But seriously, now I have to collate the text to the language just because you don't know how to order the characters. Who orders by text anyways, everyone orders by date!
[27 Feb 2009 7:48] Valeriy Kravchuk
Still, this is a reasonable feature request at least, IMHO. Should be implemented in 6.x.
[3 Dec 2009 16:37] Michael Baldwin
Any news on when this might be implemented? This thread was started two and a half years ago... I too need accent sensitivity in UTF-8...
[23 Jan 2010 21:47] ete88 ete88
So, the real problem is that there are no accent sensitive unicode collations available for MySQL. There really should be.

To stop the confusion, I suggest information about the accent sensitivity should be added to the name of the collation, e.g. utf8_german_ci_ai (case insensitive / accent insensitive).
[20 Jul 2010 23:19] André Cruz
I too need an UTF8 collation that's case insensitive and accent sensitive.
[12 Jun 2012 10:47] Hartmut Holzgraefe
Workaround: adding a case insensitive, accent sensitive collation to the charsets/Index.xml file:

http://www.skysql.com/blogs/hartmut/adding-case-insensitive-distinct-unicode-collation
[26 Mar 2017 19:37] monty solomon
New features will be forthcoming

Sushi = Beer ?! An introduction of UTF8 support in MySQL 8.0
http://mysqlserverteam.com/sushi-beer-an-introduction-of-utf8-support-in-mysql-8-0/

New collations in MySQL 8.0.0
http://mysqlserverteam.com/new-collations-in-mysql-8-0-0/