Bug #16526 utf8_unicode_ci can't distinguish some Japanese characters
Submitted: 16 Jan 2006 7:22 Modified: 15 May 2006 6:07
Reporter: Yukun Song
Status: Unsupported
Category:Server Severity:S4 (Feature request)
Version:5.0.20-Debian_1-log OS:Linux (Debian Linux)
Assigned to: Target Version:
Triage: D5 (Feature request)

[16 Jan 2006 7:22] Yukun Song
Description:
'が' and 'か' are not determined by COLLATION utf8_unicode_ci, but by COLLATION
utf8_general_ci.

How to repeat:
create a test table T with column c like the below:
create table T (
        id integer not null PRIMARY KEY AUTO_INCREMENT,
        word varchar(255) not nulL
) TYPE=MyISAM CHARACTER SET utf8 COLLATE utf8_unicode_ci;

insert two rows with Japanese characters
insert into T (word) values('が');
insert into T (word) values('か');

query, then see the problem
select * from T where word = 'か';
+----+------+
| id | word |
+----+------+
|  1 | が   |
|  2 | か   |
+----+------+

change connection to use unicode COLLATION
set collation_connection=utf8_unicode_ci;

then, try again, still get same problem
select * from T where word = 'か';
+----+------+
| id | word |
+----+------+
|  1 | が   |
|  2 | か   |
+----+------+
[16 Jan 2006 12:23] Valeriy Kravchuk
Thank you for a problem report. Please, inform about the exact version (5.0.x) of MySQL
server used, and hexademical representation of the characters you have problems with
(because they are all presented as '?' at this HTML page...)
[17 Feb 2006 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[21 Feb 2006 11:52] Yukun Song
The Japanese characters should be presented as '?' if you change your browser to use
UTF-8.
Two characters in hexademical are '304b' and '304c' respectively.

The MySQL server I 'm using is 5.0.16-Debian_0.dotdeb.1-log on Debian.
[21 Feb 2006 11:54] Yukun Song
sorry, I meant, the two characters should NOT be presented as '?' if you change your
brower to use UTF-8.
[1 Mar 2006 14:36] Valeriy Kravchuk
Please, send the results of

echo $LANG

command from your shell. Anyway, when I use hexademicals with 5.0.19-BK, I get only 1 row
for:

select * from T where word=x'304c'

and one row for x'304b'. So, they seems different in this version.
[2 Mar 2006 10:57] Yukun Song
Of course they are different if you query with hexademicals.
My $LANG is en_AU.UTF-8 .
[6 Apr 2006 18:14] Valeriy Kravchuk
Please, try to repeat with a newer version, 5.0.19, and inform about the results. If the
problem is still repeatable, please, attach a screenshots that demonstrates it.
[7 May 2006 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[7 May 2006 14:20] Yukun Song
Bad Japanese character determination

Attachment: JPchar-bad-dertermination.JPG (image/jpeg, text), 40.54 KiB.

[7 May 2006 14:28] Yukun Song
The attached snapshot is under the following environment settings,

Server version          5.0.20-Debian_1-log
Protocol version        10
Connection              Localhost via UNIX socket
UNIX socket             /var/run/mysqld/mysqld.sock

my.cnf:
[client]
default-character-set = utf8
[mysqld]
default-character-set = utf8
[mysql]
default-character-set = utf8
[7 May 2006 14:34] Yukun Song
collation utf8_general_ci works for the Japanese characters

Attachment: working_collation.JPG (image/jpeg, text), 39.66 KiB.

[14 May 2006 16:49] Valeriy Kravchuk
I've got a nice hint from one of our key developer, and would like to explain you this
problem. You are talking about two characters:

HIRAGANA LETTER KA (Unicode U+304B)
HIRAGANA LETTER GA (Unicode U+304C)

These are the voiced/unvoiced components of the same letter pair, that's why they look
almost the same.

When you says "'\が' and 'か' are not determined by COLLATION utf8_unicode_ci", the
meaning is: 'が' and 'か' are not distinguished, that is, they appear to be equal, when
I use COLLATION utf8_unicode_ci" --
similar to the fact that 'A' and 'a' are not distinguished
in a case-insensitive collation for latin1.

Looking at the "Unicode Collation Algorithm" table 4.0.0,
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt we find that HIRAGANA LETTER KA
and HIRAGANA LETTER GA both have the same primary weight: 1E57. Following that, we must
say that the characters are equal for searches.

Please, read the manual
(http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html):

"MySQL implements the utf8_unicode_ci collation according to the Unicode Collation
Algorithm (UCA) described at http://www.unicode.org/reports/tr10/. The collation uses the
version-4.0.0 UCA weight keys:
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt."

You'll find the same thing for the next Hiragana letters in the chart, KI and GI (i.e.
304d = 304d), and so on.

We realize that people want different Japanese collation. They could get it with
utf8_general_ci or with CONVERT(... to sjis), but admittedly those are usually bad
solutions.

Plans exist for a new Japanese standard collation (work in progress). 

So, it is not a bug, but a documented behaviour.
[15 May 2006 6:07] Yukun Song
Well done Mr. Valeriy Kravchuk, but then I still have to go back to utf8_general_ci for
now.