Bug #58320 | Character's outside BMP0 is converted to '?' in utf8/ucs2. | ||
---|---|---|---|
Submitted: | 19 Nov 2010 11:49 | Modified: | 25 May 2011 19:42 |
Reporter: | Bernt Marius Johnsen | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Documentation | Severity: | S3 (Non-critical) |
Version: | 5.6 | OS: | Any |
Assigned to: | Paul DuBois | CPU Architecture: | Any |
[19 Nov 2010 11:49]
Bernt Marius Johnsen
[19 Nov 2010 12:23]
Susanne Ebrecht
create table utf16table (v varchar(10) character set utf16); create table utf8table (v varchar(10) character set utf8); insert into utf16table values (unhex('D800DF84')),('?'); select hex(v) from utf16table; insert into utf8table (select v from utf16table); select hex(v) from utf8table; This isn't working that way. What environment encoding are you using for inserting? Which client encoding is set? I am not seeing a set names somewhere? Please show output from: SHOW VARIABLES LIKE 'char%'; And show output from: $ locale
[19 Nov 2010 12:48]
Bernt Marius Johnsen
My example is written independent of locale and SET NAMES. But, as you wish: locale on client (and server): LANG=en_GB.UTF-8 LC_CTYPE="en_GB.UTF-8" LC_NUMERIC="en_GB.UTF-8" LC_TIME="en_GB.UTF-8" LC_COLLATE="en_GB.UTF-8" LC_MONETARY="en_GB.UTF-8" LC_MESSAGES="en_GB.UTF-8" LC_PAPER="en_GB.UTF-8" LC_NAME="en_GB.UTF-8" LC_ADDRESS="en_GB.UTF-8" LC_TELEPHONE="en_GB.UTF-8" LC_MEASUREMENT="en_GB.UTF-8" LC_IDENTIFICATION="en_GB.UTF-8" LC_ALL= Rerun of example: mysql> SHOW VARIABLES LIKE 'char%'; +--------------------------+------------------------------------------------------------+ | Variable_name | Value | +--------------------------+------------------------------------------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | latin1 | | character_set_system | utf8 | | character_sets_dir | /export/home/tmp/mysql/trunk-bugfixing/sql/share/charsets/ | +--------------------------+------------------------------------------------------------+ 8 rows in set (0.00 sec) mysql> create table utf16table (v varchar(10) character set utf16); Query OK, 0 rows affected (0.08 sec) mysql> create table utf8table (v varchar(10) character set utf8); Query OK, 0 rows affected (0.07 sec) mysql> insert into utf16table values (unhex('D800DF84')),('?'); Query OK, 2 rows affected (0.00 sec) Records: 2 Duplicates: 0 Warnings: 0 mysql> select hex(v) from utf16table; +----------+ | hex(v) | +----------+ | D800DF84 | | 003F | +----------+ 2 rows in set (0.00 sec) mysql> insert into utf8table (select v from utf16table); Query OK, 2 rows affected, 1 warning (0.00 sec) Records: 2 Duplicates: 0 Warnings: 1 mysql> select hex(v) from utf8table; +--------+ | hex(v) | +--------+ | 3F | | 3F | +--------+ 2 rows in set (0.00 sec)
[19 Nov 2010 15:07]
Valeriy Kravchuk
Verified as described on Mac OS X. I am not sure is it a bug or not, but I can confirm the results: macbook-pro:5.5 openxs$ bin/mysql -uroot test Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 10 Server version: 5.5.7-rc-debug Source distribution Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> create table utf16table (v varchar(10) character set utf16); Query OK, 0 rows affected (0.12 sec) mysql> create table utf8table (v varchar(10) character set utf8); Query OK, 0 rows affected (0.18 sec) mysql> insert into utf16table values (unhex('D800DF84')),('?'); Query OK, 2 rows affected (0.05 sec) Records: 2 Duplicates: 0 Warnings: 0 mysql> select hex(v) from utf16table; +----------+ | hex(v) | +----------+ | D800DF84 | | 003F | +----------+ 2 rows in set (0.00 sec) mysql> insert into utf8table (select v from utf16table); Query OK, 2 rows affected, 1 warning (0.01 sec) Records: 2 Duplicates: 0 Warnings: 1 mysql> select hex(v) from utf8table; +--------+ | hex(v) | +--------+ | 3F | | 3F | +--------+ 2 rows in set (0.00 sec) mysql> select * from utf8table; +------+ | v | +------+ | ? | | ? | +------+ 2 rows in set (0.35 sec) mysql> insert into utf8table (select v from utf16table); Query OK, 2 rows affected, 1 warning (0.01 sec) Records: 2 Duplicates: 0 Warnings: 1 mysql> show warnings\G *************************** 1. row *************************** Level: Warning Code: 1366 Message: Incorrect string value: '\xD8\x00\xDF\x84' for column 'v' at row 1 1 row in set (0.00 sec) mysql> exit Bye macbook-pro:5.5 openxs$ bin/mysql -uroot test Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 11 Server version: 5.5.7-rc-debug Source distribution Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> set names utf8; Query OK, 0 rows affected (0.00 sec) mysql> insert into utf8table (select v from utf16table); Query OK, 2 rows affected, 1 warning (0.00 sec) Records: 2 Duplicates: 0 Warnings: 1 mysql> show warnings\G *************************** 1. row *************************** Level: Warning Code: 1366 Message: Incorrect string value: '\xD8\x00\xDF\x84' for column 'v' at row 1 1 row in set (0.00 sec)
[19 Nov 2010 15:08]
Peter Gulutzan
Is there something in the MySQL documentation to support the statement Should have been converted to U+FFFD (Replacement Character)"?
[19 Nov 2010 15:10]
Bernt Marius Johnsen
Could not find anything in the docs that supports neither U+003F nor U+FFFD.
[19 Nov 2010 15:19]
Alexander Barkov
This is not a bug. We only say that in Uncode-4.0.0 collations characters outside BMP are compared as REPLACEMENT CHARACTER, but we don't say that they should be converted to REPLACEMENT CHARACTER. Impossible conversion always results in QUESTION MARK.
[19 Nov 2010 15:20]
Alexander Barkov
Consider changing category to "Documentation".
[19 Nov 2010 15:41]
Valeriy Kravchuk
So, may be we should explain this in the documentation?
[22 Nov 2010 12:28]
Bernt Marius Johnsen
Something still is wrong, because when you do the same to an ucs2 column, characters outside BMP0 is *NOT* converted to '?' (nor to U+FFFD), but just truncated: mysql> create table utf16table (v varchar(10) character set utf16); Query OK, 0 rows affected (0.08 sec) mysql> insert into utf16table values (unhex('D800DF84')),('?'); Query OK, 2 rows affected (0.00 sec) Records: 2 Duplicates: 0 Warnings: 0 mysql> create table ucs2table (v varchar(10) character set ucs2); Query OK, 0 rows affected (0.08 sec) mysql> insert into ucs2table (select v from utf16table); Query OK, 2 rows affected (0.00 sec) Records: 2 Duplicates: 0 Warnings: 0 mysql> select hex(v) from ucs2table; +--------+ | hex(v) | +--------+ | 0384 | | 003F | +--------+ 2 rows in set (0.00 sec)
[25 May 2011 19:42]
Paul DuBois
Thank you for your bug report. This issue has been addressed in the documentation. The updated documentation will appear on our website shortly, and will be included in the next release of the relevant products. Added this to http://dev.mysql.com/doc/refman/5.1/en/charset-unicode.html: Characters outside the BMP compare as REPLACEMENT CHARACTER and convert to '?' when converted to a Unicode character set.