Bug #16526 | utf8_unicode_ci can't distinguish some Japanese characters | ||
---|---|---|---|
Submitted: | 16 Jan 2006 6:22 | Modified: | 15 May 2006 4:07 |
Reporter: | Yukun Song | Email Updates: | |
Status: | Unsupported | Impact on me: | |
Category: | MySQL Server | Severity: | S4 (Feature request) |
Version: | 5.0.20-Debian_1-log | OS: | Linux (Debian Linux) |
Assigned to: | CPU Architecture: | Any |
[16 Jan 2006 6:22]
Yukun Song
[16 Jan 2006 11:23]
Valeriy Kravchuk
Thank you for a problem report. Please, inform about the exact version (5.0.x) of MySQL server used, and hexademical representation of the characters you have problems with (because they are all presented as '?' at this HTML page...)
[17 Feb 2006 0:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".
[21 Feb 2006 10:52]
Yukun Song
The Japanese characters should be presented as '?' if you change your browser to use UTF-8. Two characters in hexademical are '304b' and '304c' respectively. The MySQL server I 'm using is 5.0.16-Debian_0.dotdeb.1-log on Debian.
[21 Feb 2006 10:54]
Yukun Song
sorry, I meant, the two characters should NOT be presented as '?' if you change your brower to use UTF-8.
[1 Mar 2006 13:36]
Valeriy Kravchuk
Please, send the results of echo $LANG command from your shell. Anyway, when I use hexademicals with 5.0.19-BK, I get only 1 row for: select * from T where word=x'304c' and one row for x'304b'. So, they seems different in this version.
[2 Mar 2006 9:57]
Yukun Song
Of course they are different if you query with hexademicals. My $LANG is en_AU.UTF-8 .
[6 Apr 2006 16:14]
Valeriy Kravchuk
Please, try to repeat with a newer version, 5.0.19, and inform about the results. If the problem is still repeatable, please, attach a screenshots that demonstrates it.
[6 May 2006 23:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".
[7 May 2006 12:20]
Yukun Song
Bad Japanese character determination
Attachment: JPchar-bad-dertermination.JPG (image/jpeg, text), 40.54 KiB.
[7 May 2006 12:28]
Yukun Song
The attached snapshot is under the following environment settings, Server version 5.0.20-Debian_1-log Protocol version 10 Connection Localhost via UNIX socket UNIX socket /var/run/mysqld/mysqld.sock my.cnf: [client] default-character-set = utf8 [mysqld] default-character-set = utf8 [mysql] default-character-set = utf8
[7 May 2006 12:34]
Yukun Song
collation utf8_general_ci works for the Japanese characters
Attachment: working_collation.JPG (image/jpeg, text), 39.66 KiB.
[14 May 2006 14:49]
Valeriy Kravchuk
I've got a nice hint from one of our key developer, and would like to explain you this problem. You are talking about two characters: HIRAGANA LETTER KA (Unicode U+304B) HIRAGANA LETTER GA (Unicode U+304C) These are the voiced/unvoiced components of the same letter pair, that's why they look almost the same. When you says "'\が' and 'か' are not determined by COLLATION utf8_unicode_ci", the meaning is: 'が' and 'か' are not distinguished, that is, they appear to be equal, when I use COLLATION utf8_unicode_ci" -- similar to the fact that 'A' and 'a' are not distinguished in a case-insensitive collation for latin1. Looking at the "Unicode Collation Algorithm" table 4.0.0, http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt we find that HIRAGANA LETTER KA and HIRAGANA LETTER GA both have the same primary weight: 1E57. Following that, we must say that the characters are equal for searches. Please, read the manual (http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html): "MySQL implements the utf8_unicode_ci collation according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys: http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt." You'll find the same thing for the next Hiragana letters in the chart, KI and GI (i.e. 304d = 304d), and so on. We realize that people want different Japanese collation. They could get it with utf8_general_ci or with CONVERT(... to sjis), but admittedly those are usually bad solutions. Plans exist for a new Japanese standard collation (work in progress). So, it is not a bug, but a documented behaviour.
[15 May 2006 4:07]
Yukun Song
Well done Mr. Valeriy Kravchuk, but then I still have to go back to utf8_general_ci for now.