Bug #25420 Tilde = 'Y' in Chinese
Submitted: 4 Jan 2007 18:08 Modified: 18 Jul 2008 15:59
Reporter: Peter Gulutzan Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.1.15-beta-debug-log/5.0BK OS:Linux (SUSE 10.0 / 64-bit)
Assigned to: Alexander Barkov CPU Architecture:Any
Triage: D2 (Serious) / R5 (Severe)

[4 Jan 2007 18:08] Peter Gulutzan
Description:
If I search for 'Y', and the character set is Chinese,
I get '~' (i.e. tilde, 0x7e).

Incidentally, in strings/ctype-gb2312.c, I see:
"
static uchar NEAR sort_order_gb2312[]=
{
...
  'X',	 'Y',	'Z',   '{',   '|',   '}',   'Y',   '\177'
...
"
I see the same thing in tables for GB2312, BIG5, GBK, and TIS620 (Thai).

How to repeat:
mysql> create table tbig5 (s1 char(5) character set big5);
Query OK, 0 rows affected (0.11 sec)

mysql> insert into tbig5 values ('Y'),('~');
Query OK, 2 rows affected (0.08 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select * from tbig5 where s1 = 'Y';
+------+
| s1   |
+------+
| Y    |
| ~    |
+------+
2 rows in set (0.04 sec)
[5 Jan 2007 0:39] Miguel Solorzano
Thank you for the bug report. Verified as described on FC6 32-bit.
[4 Feb 2007 1:41] Peter Gulutzan
Here is another case which looks odd.

mysql> create table t17 (s1 varchar(5) character set big5);
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t17 values (0x60);
Query OK, 1 row affected (0.00 sec)

mysql> select s1,hex(s1) from t17 where s1 = 'e';
+------+---------+
| s1   | hex(s1) |
+------+---------+
| `    | 60      |
+------+---------+
1 row in set (0.00 sec)
[27 Mar 2008 10:12] Alexander Barkov
Known problems:

big5_chinese_ci: E == e == `, Y == y == ~

cp866_general_ci: J > j

gb2312_chinese_ci: Y == y == ~

gbk_chinese_ci: Y == y == ~
[28 Apr 2008 23:52] Daniel Ahern
these are all the sets of chinese characters that are mistaken for one another in UTF-8

Attachment: testdictaccuracy-results (application/octet-stream, text), 7.60 KiB.

[28 Apr 2008 23:54] Daniel Ahern
This bug appears in UTF-8 also.  i've attached a file of all the sets of Chinese characters that are mistaken for each other.  Each set is listed as many times as there are characters.  For example, if a set has 4 characters that are each mistaken for each other, that set is listed four times.  The number in front of each set is the number of characters/words in the set.
[29 Apr 2008 4:36] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/46169

ChangeSet@1.2622, 2008-04-28 18:07:26+05:00, bar@mysql.com +14 -0
  Bug#25420 Tilde = 'Y' in Chinese
  Problems: wrong sort orders for the following characters:
  - TILDE and GRAVE ACCENT in big5_chinese_ci
  - LATIN SMALL LETTER J in cp866_general_ci
  - TILDE in gb2312_chinese_ci
  - TILDE in gbk_chinese_ci
  
  Fix: correct sort order for the above characters.
[30 Apr 2008 10:18] Alexander Barkov
Hi Daniel,

Please provide an SQL script demonstrating the problem.

We can't reproduce this problem in utf8:

mysql> show create table t1;
+-------+--------------------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                                 |
+-------+--------------------------------------------------------------------------------------------------------------+
| t1    | CREATE TABLE `t1` (
  `a` varchar(10) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 |
+-------+--------------------------------------------------------------------------------------------------------------+
1 row in set (0.04 sec)

mysql> select hex(a), a from t1;
+--------+------+
| hex(a) | a    |
+--------+------+
| E4B881 | 丁   |
| E5B881 | 币   |
+--------+------+
2 rows in set (0.00 sec)

mysql> select * from t1 t11, t1 t12 where t11.a=t12.a;
+------+------+
| a    | a    |
+------+------+
| 丁   | 丁   |
| 币   | 币   |
+------+------+
2 rows in set (0.00 sec)

I.e. I joined the table to itself. 
As you can see, 丁 is not equal to 币.
[30 Apr 2008 14:41] Daniel Ahern
create table t1 (a varchar(10)) default charset=utf8;
insert into t1 value ('丁');
insert into t1 value ('币');
select * from t1 where a = '丁';
select * from t1 t11, t1 t12 where t11.a=t12.a;

mysql> create table t1 (a varchar(10)) default charset=utf8;
Query OK, 0 rows affected (0.02 sec)

mysql> insert into t1 value ('丁');
Query OK, 1 row affected (0.01 sec)

mysql> insert into t1 value ('币');
Query OK, 1 row affected (0.00 sec)

mysql> select * from t1 where a = '丁';
+------+
| a    |
+------+
| 丁  |
| 币  |
+------+
2 rows in set (0.00 sec)

mysql> select * from t1 t11, t1 t12 where t11.a=t12.a;
+------+------+
| a    | a    |
+------+------+
| 丁  | 丁  |
| 币  | 丁  |
| 丁  | 币  |
| 币  | 币  |
+------+------+
4 rows in set (0.00 sec)
[1 May 2008 5:51] Alexander Barkov
Daniel,

Please paste result of this query:

select hex(a), a from t1;
[1 May 2008 8:25] Daniel Ahern
select hex(a), a from t1;
+--------------+------+
| hex(a)       | a    |
+--------------+------+
| C3A4C2B8C281 | 丁  |
| C3A5C2B8C281 | 币  |
+--------------+------+
[1 May 2008 9:10] Alexander Barkov
Which console program do you use? xterm, gnome-terminal, konsole?
Which character set does it use?

What does "SHOW VARIABLES LIKE 'character_set%'" return?
[1 May 2008 13:09] Daniel Ahern
Konsole, utf8.  

SHOW VARIABLES LIKE 'character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
[1 May 2008 13:18] Alexander Barkov
mysql thinks that the character set is latin1.
You need to run "mysql --default-character-set=utf8".
Another option is to add default-character-set=utf8 into
my.cnf.
The third option is to execute "set names utf8;" after 
starting mysql.
[5 May 2008 11:26] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/46349

ChangeSet@1.2642, 2008-05-05 16:22:31+05:00, bar@mysql.com +16 -0
  Bug#25420 Tilde = 'Y' in Chinese
  Problems: wrong sort orders for the following characters:
  - TILDE and GRAVE ACCENT in big5_chinese_ci
  - LATIN SMALL LETTER J in cp866_general_ci
  - TILDE in gb2312_chinese_ci
  - TILDE in gbk_chinese_ci
[5 May 2008 11:29] Alexander Barkov
Pushed into 6.0.6-engines.
[18 Jul 2008 8:29] Alexander Barkov
Merged into bzr-mysql-6.0.7
[18 Jul 2008 15:59] Paul Dubois
Noted in 6.0.6 changelog.

Certain characters were sorted incorrectly for the following
collations: TILDE and GRAVE ACCENT in big5_chinese_ci; LATIN SMALL
LETTER J in cp866_general_ci; TILDE in gb2312_chinese_ci; and TILDE
in gbk_chinese_ci. 

As a result of this fix, any indexes on columns that use these 
collations and contain the affected characters must be rebuilt when
upgrading to 6.0.6 or higher. To do this, use ALTER TABLE to drop and
re-add the indexes, or mysqldump to dump the affected tables and
mysql to reload the dump file.
[14 Sep 2008 0:51] Bugs System
Pushed into 6.0.7-alpha  (revid:sp1r-bar@mysql.com/bar.myoffice.izhnet.ru-20080505112231-54868) (version source revid:vvaintroub@mysql.com-20080804094710-jb2qpqxpf2ir2gf3) (pib:3)