MySQL Bugs: #76822: GBK and GB2312 charset doesn't support EUDC to unicode PUA conversion

Bug #76822	GBK and GB2312 charset doesn't support EUDC to unicode PUA conversion
Submitted:	24 Apr 2015 10:41	Modified:	27 Apr 2015 7:22
Reporter:	Xindong Su	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S2 (Serious)
Version:	all	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	charset, eudc, gb2312, gbk, ODBC, pua

Description:
MySQL GBK and GB2312 charset implemenatation does not support converting EUDC(End User Defined Character) char code into unicode PUA char code.

On server side, it leads to data loss when inserting such chars into table.

On client side, e.g using odbc, it causes the odbc driver dead loop when parsing the sql command which contains EUDC chars.

How to repeat:
Test case 1: using any client program of which connection to database is based on utf8 charset, like sqlyog.

CREATE TABLE `test_gbk` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `c1` varchar(20) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=gbk;

INSERT INTO test_gbk SET c1=""; 

This PUA char E000 shoule be convert to GBK EUDC char code AAA1, but the execution will results in warning:

Warning Code : 1366
Incorrect string value: '\xEE\x80\x80' for column 'c1' at row 1

So we take a look of the content of field c1:

SELECT HEX(c1) FROM test_gbk;

hex(c1)
3F

Obviously, data loss.

Test case 2: using MYSQL ODBC ANSI Driver (any 5.x version), and any client program which can do connection to database by using odbc driver , like ADOQUERY(http://www.mitec.cz/adoq.html):

1, use connection string to connect to database, e.g (please replace ***** to username/password corresponding to your test environment):

Provider=MSDASQL.1;Password=*****;Persist Security Info=True;User ID=*****;Extended Properties="DRIVER={MySQL ODBC 5.3 Driver};SERVER=127.0.0.1;DATABASE=test;UID=*****;PWD=*****;OPTION=2058;CHARSET=gbk;BIG_PACKETS=1;COMPRESSED_PROTO=1;NO_PROMPT=1;MULTI_STATEMENTS=1"

2, execute sql command:

update test_gbk set c1="" where id=1;

The client program hangs. Notes: Because of this bug report page is in utf8, the browser automatically converts the EUDC char to PUA char. So please check the attachment file, don't do any copy-and-paste thing from the web page.

By attaching VS2010 to client program, we can found that the dead loop occurs in parse.c, function tokenize. The cause is that the function my_mb_wc_gbk in ctype-gbk.c  can't recognize the EUDC char and always return 0, so the function step_char can't step forward to next char.

Suggested fix:
Please do some supplements to ctype-gbk.c and ctype-gb2312.c for converting EUDC to PUA and vice versa. The matching rules are:

GBK/GB2312 EUDC area <-> unicode PUA area
           AAA1~AFFE <-> E000~E233
           A140~A7A0 <-> E4C6~E765
           F8A1~FEFE <-> E234~E4C5

sql commands in GBK charset

Attachment: sql.txt (text/plain), 669 bytes.

Hello Xindong Su,

Thank you for the bug report.
This is duplicate of http://bugs.mysql.com/bug.php?id=67739.

Thanks,
Chiranjeevi.