Bug #67739 GBK charset is not Fully supported in mysql
Submitted: 28 Nov 2012 9:29 Modified: 3 Dec 2012 7:28
Reporter: vin chen Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:All OS:Any
Assigned to: CPU Architecture:Any
Tags: charset, gbk, MySQL

[28 Nov 2012 9:29] vin chen
Description:
mysql doesn't support gbk code between 0xFE50 and 0xFEA0, Corresponding to unicode between 0xE815 and 0xE864。

http://www.khngai.com/chinese/charmap/tblgbk.php?page=7 

How to repeat:
mysql> create table t_gbk(c1 int, c2 varchar(20)) engine=innodb default charset= utf8;
Query OK, 0 rows affected (0.05 sec)

mysql> set names gbk;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t_gbk values(1,'');
Query OK, 1 row affected, 1 warning (0.05 sec)

mysql> show warnings;
+---------+------+-------------------------------------------------------------+
| Level   | Code | Message                                                     |
+---------+------+-------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\xFE\x80' for column 'c2' at row 1 |
+---------+------+-------------------------------------------------------------+
1 row in set (0.00 sec)

#0xFE80 is a valid gbk code

mysql> insert into t_gbk values(1,'诉');
Query OK, 1 row affected (0.06 sec)

mysql> select * from t_gbk;
+------+------+
| c1   | c2   |
+------+------+
|    1 | ?    |
|    1 | 诉   |
+------+------+
2 rows in set (0.00 sec)

Suggested fix:
add the  0xFE50~0xFEA0 gbk code in strings/ctype-gbk.c
[28 Nov 2012 19:33] Sveta Smirnova
Thank you for the report.

MySQL supports gbk fully, including your character:

create table t_gbk(c1 int, c2 varchar(20)) engine=innodb default charset= gbk;
set names gbk;
insert into t_gbk values(1,'��');
insert into t_gbk values(1,'��');
select * from t_gbk;
c1	c2
1	��
1	�
select hex(c2) from t_gbk;
hex(c2)
FE80
CBDF

In your test you insert into table with UTF8 charset and this is the reason why you get error.

So technically this is not a bug, but feature request: "Add UTF support for more Chinese characters". Do you know to which Unicode code this character corresponds?
[29 Nov 2012 14:45] Sveta Smirnova
We discussed this internally and I got confirmation that GBK symbols 0xFE50..0xFEA0 are not converted to U+E815..U+E864 in our Unicode implementation.

What you want looks like GBK 1.0 as described at http://en.wikipedia.org/wiki/GBK :

----<q>----
In 1995, China National Information Technology Standardization
Technical Committee set down the Chinese Internal Code Specification
(Chinese: 汉字内码扩展规范(GBK); pinyin: Hànzì Nèimǎ Kuòzhǎn Guīfàn
(GBK)), Version 1.0, known as GBK 1.0, which is a slight extension of
Codepage 936. The newly added 95 characters were not found in GB
13000.1-1993, and were provisionally assigned Unicode PUA code points. 
----</q>----

We can add these conversions, but we need some kind of official document. Web site you are linking is not "China National Information Technology Standardization Technical Committee", therefore we can not use it as such a confirmation.

But maybe do you know where is "Chinese Internal Code Specification (GBK), Version 1.0. " standard located?
[30 Nov 2012 3:23] vin chen
Sorry,I can't find the offical document. 
But 0xfe50-0xFEA0 also marked valid in http://ff.163.com/newflyff/gbk-list/ which published by "全国信息技术标准化技术委员会"(China National Information Technology Standardization Technical Committee)

And from http://www.fmddlmyy.cn/text24.html said,
在制定GBK时,Unicode中还没有这些字符,所以使用了专用区的码位,这80个字符的码位是0xE815-0xE864。后来,Unicode将52个汉字收录到“CJK统一汉字扩充A”。28个部首中有14个部首被收录到“CJK部首补充区”。所以在上图中,这些字符都有两个Unicode编码。

which means that these characters have not corresponding Unicode while formulating GBK charset,and Unicode added them to "CJK Unified Ideographs Extension A" later.

Maybe MySQL doesn't synchronized these modification.
[30 Nov 2012 16:42] Sveta Smirnova
Thank you for the feedback.

I set report to "Verified", so we will consider if we can implement this.

> which means that these characters have not corresponding Unicode while formulating GBK charset,and Unicode added them to "CJK Unified Ideographs Extension A" later.
...
> Maybe MySQL doesn't synchronized these modification.

Yes, MySQL does not support these new additions to GBK.
[3 Dec 2012 7:28] vin chen
GBK:0xD7FA~0xD7FE
unicode:0xe810~0xe814

means space character in GBK

These conversions shoule also be added to mysql.
[8 Sep 2013 18:18] Peter Laursen
Also see 
http://bugs.mysql.com/bug.php?id=70270
http://bugs.mysql.com/bug.php?id=70271
[8 Sep 2013 18:20] Peter Laursen
And this is DEEFINITELY more than a feature request.

This means that a dump cannot be restored. This is CRITICAL (at least 'S2' in the categorizations available here).
[27 Apr 2015 7:24] Chiranjeevi Battula
http://bugs.mysql.com/bug.php?id=76822 marked as duplicate of this one.