Bug #16526 utf8_unicode_ci can't distinguish some Japanese characters
Submitted: 16 Jan 2006 6:22 Modified: 15 May 2006 4:07
Reporter: Yukun Song Email Updates:
Status: Unsupported Impact on me:
None 
Category:MySQL Server Severity:S4 (Feature request)
Version:5.0.20-Debian_1-log OS:Linux (Debian Linux)
Assigned to: CPU Architecture:Any
Triage: D5 (Feature request)

[16 Jan 2006 6:22] Yukun Song
Description:
'が' and 'か' are not determined by COLLATION utf8_unicode_ci, but by COLLATION utf8_general_ci.

How to repeat:
create a test table T with column c like the below:
create table T (
        id integer not null PRIMARY KEY AUTO_INCREMENT,
        word varchar(255) not nulL
) TYPE=MyISAM CHARACTER SET utf8 COLLATE utf8_unicode_ci;

insert two rows with Japanese characters
insert into T (word) values('が');
insert into T (word) values('か');

query, then see the problem
select * from T where word = 'か';
+----+------+
| id | word |
+----+------+
|  1 | が   |
|  2 | か   |
+----+------+

change connection to use unicode COLLATION
set collation_connection=utf8_unicode_ci;

then, try again, still get same problem
select * from T where word = 'か';
+----+------+
| id | word |
+----+------+
|  1 | が   |
|  2 | か   |
+----+------+
[16 Jan 2006 11:23] Valeriy Kravchuk
Thank you for a problem report. Please, inform about the exact version (5.0.x) of MySQL server used, and hexademical representation of the characters you have problems with (because they are all presented as '?' at this HTML page...)
[17 Feb 2006 0:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[21 Feb 2006 10:52] Yukun Song
The Japanese characters should be presented as '?' if you change your browser to use UTF-8.
Two characters in hexademical are '304b' and '304c' respectively.

The MySQL server I 'm using is 5.0.16-Debian_0.dotdeb.1-log on Debian.
[21 Feb 2006 10:54] Yukun Song
sorry, I meant, the two characters should NOT be presented as '?' if you change your brower to use UTF-8.
[1 Mar 2006 13:36] Valeriy Kravchuk
Please, send the results of

echo $LANG

command from your shell. Anyway, when I use hexademicals with 5.0.19-BK, I get only 1 row for:

select * from T where word=x'304c'

and one row for x'304b'. So, they seems different in this version.
[2 Mar 2006 9:57] Yukun Song
Of course they are different if you query with hexademicals.
My $LANG is en_AU.UTF-8 .
[6 Apr 2006 16:14] Valeriy Kravchuk
Please, try to repeat with a newer version, 5.0.19, and inform about the results. If the problem is still repeatable, please, attach a screenshots that demonstrates it.
[6 May 2006 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[7 May 2006 12:20] Yukun Song
Bad Japanese character determination

Attachment: JPchar-bad-dertermination.JPG (image/jpeg, text), 40.54 KiB.

[7 May 2006 12:28] Yukun Song
The attached snapshot is under the following environment settings,

Server version          5.0.20-Debian_1-log
Protocol version        10
Connection              Localhost via UNIX socket
UNIX socket             /var/run/mysqld/mysqld.sock

my.cnf:
[client]
default-character-set = utf8
[mysqld]
default-character-set = utf8
[mysql]
default-character-set = utf8
[7 May 2006 12:34] Yukun Song
collation utf8_general_ci works for the Japanese characters

Attachment: working_collation.JPG (image/jpeg, text), 39.66 KiB.

[14 May 2006 14:49] Valeriy Kravchuk
I've got a nice hint from one of our key developer, and would like to explain you this problem. You are talking about two characters:

HIRAGANA LETTER KA (Unicode U+304B)
HIRAGANA LETTER GA (Unicode U+304C)

These are the voiced/unvoiced components of the same letter pair, that's why they look almost the same.

When you says "'\が' and 'か' are not determined by COLLATION utf8_unicode_ci", the meaning is: 'が' and 'か' are not distinguished, that is, they appear to be equal, when I use COLLATION utf8_unicode_ci" --
similar to the fact that 'A' and 'a' are not distinguished
in a case-insensitive collation for latin1.

Looking at the "Unicode Collation Algorithm" table 4.0.0, http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt we find that HIRAGANA LETTER KA and HIRAGANA LETTER GA both have the same primary weight: 1E57. Following that, we must say that the characters are equal for searches.

Please, read the manual (http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html):

"MySQL implements the utf8_unicode_ci collation according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys:
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt."

You'll find the same thing for the next Hiragana letters in the chart, KI and GI (i.e. 304d = 304d), and so on.

We realize that people want different Japanese collation. They could get it with utf8_general_ci or with CONVERT(... to sjis), but admittedly those are usually bad solutions.

Plans exist for a new Japanese standard collation (work in progress). 

So, it is not a bug, but a documented behaviour.
[15 May 2006 4:07] Yukun Song
Well done Mr. Valeriy Kravchuk, but then I still have to go back to utf8_general_ci for now.