MySQL Bugs: #19567: Case insensitive multilingual unicode collations

Bug #19567	Case insensitive multilingual unicode collations
Submitted:	5 May 2006 17:19	Modified:	27 Feb 2009 7:48
Reporter:	Lutz Schwarz	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S4 (Feature request)
Version:	5.0.22, 4.1.15	OS:	Windows (windows)
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
All currently available case insensitive multilangual collations (e.g. utf8_general_ci and ucs2_general_ci) treat different characters the same even if they differ in more than just upper/lower case.

This problem does not happen with case insensitive multilingual 8-bit collations (e.g. latin1_general_ci).

This bug prevents that data can be converted from a case insensitive multilingual 8-bit character set to a case insensitive multilingual unicode character set.

The problem applies to utf8_unicode_ci and ucs2_unicode_ci as well, but for this collations this could be the desired behaviour.

It seems that the general unicode collations behave like german collations should behave. This could be the reason that there is no unicode collation for german.

How to repeat:
mysql> select convert('a' using latin1) = convert('ä' using latin1) collate latin1_general_ci;
+---------------------------------------------------------------------------------+
| convert('a' using latin1) = convert('ä' using latin1) collate latin1_general_ci |
+---------------------------------------------------------------------------------+
|                                                                               0 |
+---------------------------------------------------------------------------------+
1 row in set (0.00 sec)

# The above was an expected result

mysql> select convert('a' using utf8) = convert('ä' using utf8) collate utf8_general_ci;
+---------------------------------------------------------------------------+
| convert('a' using utf8) = convert('ä' using utf8) collate utf8_general_ci |
+---------------------------------------------------------------------------+
|                                                                         1 |
+---------------------------------------------------------------------------+

# The above was an unexpected result. I could image a workaround like the following:

mysql> select upper(convert('a' using utf8)) = upper(convert('ä' using utf8)) collate utf8_bin;
+----------------------------------------------------------------------------------+
| upper(convert('a' using utf8)) = upper(convert('ä' using utf8)) collate utf8_bin |
+----------------------------------------------------------------------------------+
|                                                                                0 |
+----------------------------------------------------------------------------------+

# The above was an expected result. But there is no workaround for the following:

mysql> show create table person\G
*************************** 1. row ***************************
       Table: person
Create Table: CREATE TABLE `person` (
  `name` varchar(255) collate latin1_general_ci NOT NULL default '',
  PRIMARY KEY  (`name`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci
1 row in set (0.02 sec)

mysql> select * from person;
+------+
| name |
+------+
| A    |
| ä    |
+------+
2 rows in set (0.00 sec)

mysql> alter table person convert to CHARSET utf8 COLLATE utf8_general_ci;
ERROR 1062 (23000): Doppelter Eintrag 'Ã¤' für Schlüssel 1

Suggested fix:
It seems that the uni_plane[] data in ctype-utf8 is wrong for (at least) the sort member of MY_UNICASE_INFO; it contains values that suit german collations.

If the uni_plane[] tables got fixed, a german unicode collation should be supplied and implemented via UCA like utf8_turkish_ci etc.

A different way to fix the problem could be to implement new unicode collations like "utf8_bin_but_ci" and "ucs2_bin_but_ci" (binary but case insensitive, something like that...). This is a dirty but downward compatible fix I could live with.

files for ucs2_bin_but_ci solution

Attachment: m_ctype.h (application/octet-stream, text), 18.37 KiB.

files for ucs2_bin_but_ci solution

Attachment: charset-def.c (application/octet-stream, text), 6.97 KiB.

files for ucs2_bin_but_ci solution

Attachment: ctype-ucs2.c (application/octet-stream, text), 41.56 KiB.

files for ucs2_bin_but_ci solution

Attachment: charset-def.c (application/octet-stream, text), 6.97 KiB.

files for ucs2_bin_but_ci solution

Attachment: ctype-ucs2.c (application/octet-stream, text), 41.50 KiB.

Thank you for a problem report and patches. Verified just as described, also with 5.0.22 on Windows:

mysql> select version();
+---------------------+
| version()           |
+---------------------+
| 5.0.22-community-nt |
+---------------------+
1 row in set (0.00 sec)

mysql> CREATE TABLE `person` (
    ->   `name` varchar(255) collate latin1_general_ci NOT NULL default ''
    ->   PRIMARY KEY  (`name`)
    -> ) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;
Query OK, 0 rows affected (0.05 sec)

mysql> INSERT INTO `person` VALUES ('A'), ('ä');
Query OK, 2 rows affected (0.03 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> alter table person convert to CHARSET utf8 COLLATE utf8_general_ci;
ERROR 1062 (23000): Duplicate entry 'Ã¤' for key 1

As you just want to add another, new collation, it is a feature request.

A feature? Sorry, but I do not agree with that. The canonical fix is to correct behaviour of case insensitive multilingual unicode collations (namely utf8_general_ci and ucs2_general_ci) in a way, that they do what their names propose.

My suggestion to introduce new collations that behave like utf8_general_ci/ucs2_general_ci should behave is a poor hack - not a feature.

Since workaround is not to use unicode at all, I set severity to S3 as a compromise. In fact it is a serious bug because it prevents upgrading databases to unicode for use in international environments and markets.

In fact, latin1_general_ci and utf8_general_ci are not similar at all. And this is know fact. So, utf8_general_ci is just inappropriate name for current collation as it is implemented. At least this is a bug. "Proper" implementation of "correct" utf8_general_ci is a reasonable feature request.

After discussing with the team, we agreed that:

- this is not a bug. This is a feature request a new
accent sensitive collation for utf8 and ucs2.

- latin1_general_ci and utf8_general_ci behave differently,
and this is confusing. We're sorry about that.

- When we added latin1_general_ci, its name was chosen in a mistake.

- we won't fix latin1_general_ci to behave accent insensitivly,
like utf8_general_ci does. Changing order at this point would
be confusing too.

I don't get what you guys are doing. This is what we need: A way to search for strings in both case in- and sensitive ways with complete UTF-8 characterset on innodb, preferably with full text index. Now how hard can that be? ;)

But seriously, now I have to collate the text to the language just because you don't know how to order the characters. Who orders by text anyways, everyone orders by date!

Still, this is a reasonable feature request at least, IMHO. Should be implemented in 6.x.

Any news on when this might be implemented? This thread was started two and a half years ago... I too need accent sensitivity in UTF-8...

So, the real problem is that there are no accent sensitive unicode collations available for MySQL. There really should be.

To stop the confusion, I suggest information about the accent sensitivity should be added to the name of the collation, e.g. utf8_german_ci_ai (case insensitive / accent insensitive).

I too need an UTF8 collation that's case insensitive and accent sensitive.

Workaround: adding a case insensitive, accent sensitive collation to the charsets/Index.xml file:

http://www.skysql.com/blogs/hartmut/adding-case-insensitive-distinct-unicode-collation

New features will be forthcoming

Sushi = Beer ?! An introduction of UTF8 support in MySQL 8.0
http://mysqlserverteam.com/sushi-beer-an-introduction-of-utf8-support-in-mysql-8-0/

New collations in MySQL 8.0.0
http://mysqlserverteam.com/new-collations-in-mysql-8-0-0/

Thanks for the new collations in MySQL 8. But I'm still not seeing any unicode collations like as_ci (accent-sensitive but case-insensitive), which is what this was all about. What version can these be found in?