MySQL Bugs: #16526: utf8_unicode_ci can't distinguish some Japanese characters

Bug #16526	utf8_unicode_ci can't distinguish some Japanese characters
Submitted:	16 Jan 2006 6:22	Modified:	15 May 2006 4:07
Reporter:	Yukun Song	Email Updates:
Status:	Unsupported	Impact on me:	None
Category:	MySQL Server	Severity:	S4 (Feature request)
Version:	5.0.20-Debian_1-log	OS:	Linux (Debian Linux)
Assigned to:		CPU Architecture:	Any

Description:
'が' and 'か' are not determined by COLLATION utf8_unicode_ci, but by COLLATION utf8_general_ci.

How to repeat:
create a test table T with column c like the below:
create table T (
        id integer not null PRIMARY KEY AUTO_INCREMENT,
        word varchar(255) not nulL
) TYPE=MyISAM CHARACTER SET utf8 COLLATE utf8_unicode_ci;

insert two rows with Japanese characters
insert into T (word) values('が');
insert into T (word) values('か');

query, then see the problem
select * from T where word = 'か';
+----+------+
| id | word |
+----+------+
|  1 | が   |
|  2 | か   |
+----+------+

change connection to use unicode COLLATION
set collation_connection=utf8_unicode_ci;

then, try again, still get same problem
select * from T where word = 'か';
+----+------+
| id | word |
+----+------+
|  1 | が   |
|  2 | か   |
+----+------+

Thank you for a problem report. Please, inform about the exact version (5.0.x) of MySQL server used, and hexademical representation of the characters you have problems with (because they are all presented as '?' at this HTML page...)

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

The Japanese characters should be presented as '?' if you change your browser to use UTF-8.
Two characters in hexademical are '304b' and '304c' respectively.

The MySQL server I 'm using is 5.0.16-Debian_0.dotdeb.1-log on Debian.

sorry, I meant, the two characters should NOT be presented as '?' if you change your brower to use UTF-8.

Please, send the results of

echo $LANG

command from your shell. Anyway, when I use hexademicals with 5.0.19-BK, I get only 1 row for:

select * from T where word=x'304c'

and one row for x'304b'. So, they seems different in this version.

Of course they are different if you query with hexademicals.
My $LANG is en_AU.UTF-8 .

Please, try to repeat with a newer version, 5.0.19, and inform about the results. If the problem is still repeatable, please, attach a screenshots that demonstrates it.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

Bad Japanese character determination

Attachment: JPchar-bad-dertermination.JPG (image/jpeg, text), 40.54 KiB.

The attached snapshot is under the following environment settings,

Server version          5.0.20-Debian_1-log
Protocol version        10
Connection              Localhost via UNIX socket
UNIX socket             /var/run/mysqld/mysqld.sock

my.cnf:
[client]
default-character-set = utf8
[mysqld]
default-character-set = utf8
[mysql]
default-character-set = utf8

collation utf8_general_ci works for the Japanese characters

Attachment: working_collation.JPG (image/jpeg, text), 39.66 KiB.

I've got a nice hint from one of our key developer, and would like to explain you this problem. You are talking about two characters:

HIRAGANA LETTER KA (Unicode U+304B)
HIRAGANA LETTER GA (Unicode U+304C)

These are the voiced/unvoiced components of the same letter pair, that's why they look almost the same.

When you says "'\が' and 'か' are not determined by COLLATION utf8_unicode_ci", the meaning is: 'が' and 'か' are not distinguished, that is, they appear to be equal, when I use COLLATION utf8_unicode_ci" --
similar to the fact that 'A' and 'a' are not distinguished
in a case-insensitive collation for latin1.

Looking at the "Unicode Collation Algorithm" table 4.0.0, http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt we find that HIRAGANA LETTER KA and HIRAGANA LETTER GA both have the same primary weight: 1E57. Following that, we must say that the characters are equal for searches.

Please, read the manual (http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html):

"MySQL implements the utf8_unicode_ci collation according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys:
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt."

You'll find the same thing for the next Hiragana letters in the chart, KI and GI (i.e. 304d = 304d), and so on.

We realize that people want different Japanese collation. They could get it with utf8_general_ci or with CONVERT(... to sjis), but admittedly those are usually bad solutions.

Plans exist for a new Japanese standard collation (work in progress).

So, it is not a bug, but a documented behaviour.

Well done Mr. Valeriy Kravchuk, but then I still have to go back to utf8_general_ci for now.