MySQL Bugs: #25420: Tilde = 'Y' in Chinese

Bug #25420	Tilde = 'Y' in Chinese
Submitted:	4 Jan 2007 18:08	Modified:	18 Jul 2008 15:59
Reporter:	Peter Gulutzan	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S3 (Non-critical)
Version:	5.1.15-beta-debug-log/5.0BK	OS:	Linux (SUSE 10.0 / 64-bit)
Assigned to:	Alexander Barkov	CPU Architecture:	Any

Description:
If I search for 'Y', and the character set is Chinese,
I get '~' (i.e. tilde, 0x7e).

Incidentally, in strings/ctype-gb2312.c, I see:
"
static uchar NEAR sort_order_gb2312[]=
{
...
  'X',	 'Y',	'Z',   '{',   '|',   '}',   'Y',   '\177'
...
"
I see the same thing in tables for GB2312, BIG5, GBK, and TIS620 (Thai).

How to repeat:
mysql> create table tbig5 (s1 char(5) character set big5);
Query OK, 0 rows affected (0.11 sec)

mysql> insert into tbig5 values ('Y'),('~');
Query OK, 2 rows affected (0.08 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select * from tbig5 where s1 = 'Y';
+------+
| s1   |
+------+
| Y    |
| ~    |
+------+
2 rows in set (0.04 sec)

Thank you for the bug report. Verified as described on FC6 32-bit.

Here is another case which looks odd.

mysql> create table t17 (s1 varchar(5) character set big5);
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t17 values (0x60);
Query OK, 1 row affected (0.00 sec)

mysql> select s1,hex(s1) from t17 where s1 = 'e';
+------+---------+
| s1   | hex(s1) |
+------+---------+
| `    | 60      |
+------+---------+
1 row in set (0.00 sec)

Known problems:

big5_chinese_ci: E == e == `, Y == y == ~

cp866_general_ci: J > j

gb2312_chinese_ci: Y == y == ~

gbk_chinese_ci: Y == y == ~

these are all the sets of chinese characters that are mistaken for one another in UTF-8

Attachment: testdictaccuracy-results (application/octet-stream, text), 7.60 KiB.

This bug appears in UTF-8 also.  i've attached a file of all the sets of Chinese characters that are mistaken for each other.  Each set is listed as many times as there are characters.  For example, if a set has 4 characters that are each mistaken for each other, that set is listed four times.  The number in front of each set is the number of characters/words in the set.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/46169

ChangeSet@1.2622, 2008-04-28 18:07:26+05:00, bar@mysql.com +14 -0
  Bug#25420 Tilde = 'Y' in Chinese
  Problems: wrong sort orders for the following characters:
  - TILDE and GRAVE ACCENT in big5_chinese_ci
  - LATIN SMALL LETTER J in cp866_general_ci
  - TILDE in gb2312_chinese_ci
  - TILDE in gbk_chinese_ci
  
  Fix: correct sort order for the above characters.

Hi Daniel,

Please provide an SQL script demonstrating the problem.

We can't reproduce this problem in utf8:

mysql> show create table t1;
+-------+--------------------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                                 |
+-------+--------------------------------------------------------------------------------------------------------------+
| t1    | CREATE TABLE `t1` (
  `a` varchar(10) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 |
+-------+--------------------------------------------------------------------------------------------------------------+
1 row in set (0.04 sec)

mysql> select hex(a), a from t1;
+--------+------+
| hex(a) | a    |
+--------+------+
| E4B881 | 丁   |
| E5B881 | 币   |
+--------+------+
2 rows in set (0.00 sec)

mysql> select * from t1 t11, t1 t12 where t11.a=t12.a;
+------+------+
| a    | a    |
+------+------+
| 丁   | 丁   |
| 币   | 币   |
+------+------+
2 rows in set (0.00 sec)

I.e. I joined the table to itself. 
As you can see, 丁 is not equal to 币.

create table t1 (a varchar(10)) default charset=utf8;
insert into t1 value ('丁');
insert into t1 value ('币');
select * from t1 where a = '丁';
select * from t1 t11, t1 t12 where t11.a=t12.a;

mysql> create table t1 (a varchar(10)) default charset=utf8;
Query OK, 0 rows affected (0.02 sec)

mysql> insert into t1 value ('丁');
Query OK, 1 row affected (0.01 sec)

mysql> insert into t1 value ('币');
Query OK, 1 row affected (0.00 sec)

mysql> select * from t1 where a = '丁';
+------+
| a    |
+------+
| 丁  |
| 币  |
+------+
2 rows in set (0.00 sec)

mysql> select * from t1 t11, t1 t12 where t11.a=t12.a;
+------+------+
| a    | a    |
+------+------+
| 丁  | 丁  |
| 币  | 丁  |
| 丁  | 币  |
| 币  | 币  |
+------+------+
4 rows in set (0.00 sec)

Daniel,

Please paste result of this query:

select hex(a), a from t1;

select hex(a), a from t1;
+--------------+------+
| hex(a)       | a    |
+--------------+------+
| C3A4C2B8C281 | 丁  |
| C3A5C2B8C281 | 币  |
+--------------+------+

Which console program do you use? xterm, gnome-terminal, konsole?
Which character set does it use?

What does "SHOW VARIABLES LIKE 'character_set%'" return?

Konsole, utf8.  

SHOW VARIABLES LIKE 'character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

mysql thinks that the character set is latin1.
You need to run "mysql --default-character-set=utf8".
Another option is to add default-character-set=utf8 into
my.cnf.
The third option is to execute "set names utf8;" after 
starting mysql.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/46349

ChangeSet@1.2642, 2008-05-05 16:22:31+05:00, bar@mysql.com +16 -0
  Bug#25420 Tilde = 'Y' in Chinese
  Problems: wrong sort orders for the following characters:
  - TILDE and GRAVE ACCENT in big5_chinese_ci
  - LATIN SMALL LETTER J in cp866_general_ci
  - TILDE in gb2312_chinese_ci
  - TILDE in gbk_chinese_ci

Pushed into 6.0.6-engines.

Merged into bzr-mysql-6.0.7

Noted in 6.0.6 changelog.

Certain characters were sorted incorrectly for the following
collations: TILDE and GRAVE ACCENT in big5_chinese_ci; LATIN SMALL
LETTER J in cp866_general_ci; TILDE in gb2312_chinese_ci; and TILDE
in gbk_chinese_ci. 

As a result of this fix, any indexes on columns that use these 
collations and contain the affected characters must be rebuilt when
upgrading to 6.0.6 or higher. To do this, use ALTER TABLE to drop and
re-add the indexes, or mysqldump to dump the affected tables and
mysql to reload the dump file.

Pushed into 6.0.7-alpha  (revid:sp1r-bar@mysql.com/bar.myoffice.izhnet.ru-20080505112231-54868) (version source revid:vvaintroub@mysql.com-20080804094710-jb2qpqxpf2ir2gf3) (pib:3)