Bug #58320 Character's outside BMP0 is converted to '?' in utf8/ucs2.
Submitted: 19 Nov 2010 11:49 Modified: 25 May 2011 19:42
Reporter: Bernt Marius Johnsen Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Documentation Severity:S3 (Non-critical)
Version:5.6 OS:Any
Assigned to: Paul DuBois CPU Architecture:Any

[19 Nov 2010 11:49] Bernt Marius Johnsen
Description:
Character's outside BMP0 is converted to '?' in utf8 or ucs2. Should have been converted to U+FFFD (Replacement Character). This applies to characters in utf16 or utf32 with a code point value larger than 0xFFFF;

mysql> create table utf16table (v varchar(10) character set utf16);
Query OK, 0 rows affected (0.07 sec)

mysql> create table utf8table (v varchar(10) character set utf8);
Query OK, 0 rows affected (0.06 sec)

mysql> insert into utf16table values (unhex('D800DF84')),('?');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select hex(v) from utf16table;
+----------+
| hex(v)   |
+----------+
| D800DF84 |
| 003F     |
+----------+
2 rows in set (0.00 sec)

mysql> insert into utf8table (select v from utf16table);
Query OK, 2 rows affected, 1 warning (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 1

mysql> select hex(v) from utf8table;
+--------+
| hex(v) |
+--------+
| 3F     |
| 3F     |
+--------+
2 rows in set (0.00 sec)

How to repeat:
create table utf16table (v varchar(10) character set utf16);
create table utf8table (v varchar(10) character set utf8);
insert into utf16table values (unhex('D800DF84')),('?');
select hex(v) from utf16table;
insert into utf8table (select v from utf16table);
select hex(v) from utf8table;
[19 Nov 2010 12:23] Susanne Ebrecht
create table utf16table (v varchar(10) character set utf16);
create table utf8table (v varchar(10) character set utf8);
insert into utf16table values (unhex('D800DF84')),('?');
select hex(v) from utf16table;
insert into utf8table (select v from utf16table);
select hex(v) from utf8table;

This isn't working that way.

What environment encoding are you using for inserting?
Which client encoding is set?

I am not seeing a set names somewhere?

Please show output from:
SHOW VARIABLES LIKE 'char%';

And show output from:
$ locale
[19 Nov 2010 12:48] Bernt Marius Johnsen
My example is written independent of locale and SET NAMES. But, as you wish:

locale on client (and server):

LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=

Rerun of example:

mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+------------------------------------------------------------+
| Variable_name            | Value                                                      |
+--------------------------+------------------------------------------------------------+
| character_set_client     | utf8                                                       |
| character_set_connection | utf8                                                       |
| character_set_database   | latin1                                                     |
| character_set_filesystem | binary                                                     |
| character_set_results    | utf8                                                       |
| character_set_server     | latin1                                                     |
| character_set_system     | utf8                                                       |
| character_sets_dir       | /export/home/tmp/mysql/trunk-bugfixing/sql/share/charsets/ |
+--------------------------+------------------------------------------------------------+
8 rows in set (0.00 sec)

mysql> create table utf16table (v varchar(10) character set utf16);
Query OK, 0 rows affected (0.08 sec)

mysql> create table utf8table (v varchar(10) character set utf8);
Query OK, 0 rows affected (0.07 sec)

mysql> insert into utf16table values (unhex('D800DF84')),('?');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select hex(v) from utf16table;
+----------+
| hex(v)   |
+----------+
| D800DF84 |
| 003F     |
+----------+
2 rows in set (0.00 sec)

mysql> insert into utf8table (select v from utf16table);
Query OK, 2 rows affected, 1 warning (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 1

mysql> select hex(v) from utf8table;
+--------+
| hex(v) |
+--------+
| 3F     |
| 3F     |
+--------+
2 rows in set (0.00 sec)
[19 Nov 2010 15:07] Valeriy Kravchuk
Verified as described on Mac OS X. I am not sure is it a bug or not, but I can confirm the results:

macbook-pro:5.5 openxs$ bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 10
Server version: 5.5.7-rc-debug Source distribution

Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> create table utf16table (v varchar(10) character set utf16);
Query OK, 0 rows affected (0.12 sec)

mysql> create table utf8table (v varchar(10) character set utf8);
Query OK, 0 rows affected (0.18 sec)

mysql> insert into utf16table values (unhex('D800DF84')),('?');
Query OK, 2 rows affected (0.05 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select hex(v) from utf16table;
+----------+
| hex(v)   |
+----------+
| D800DF84 |
| 003F     |
+----------+
2 rows in set (0.00 sec)

mysql> insert into utf8table (select v from utf16table);
Query OK, 2 rows affected, 1 warning (0.01 sec)
Records: 2  Duplicates: 0  Warnings: 1

mysql> select hex(v) from utf8table;
+--------+
| hex(v) |
+--------+
| 3F     |
| 3F     |
+--------+
2 rows in set (0.00 sec)

mysql> select * from utf8table;
+------+
| v    |
+------+
| ?    |
| ?    |
+------+
2 rows in set (0.35 sec)

mysql> insert into utf8table (select v from utf16table);
Query OK, 2 rows affected, 1 warning (0.01 sec)
Records: 2  Duplicates: 0  Warnings: 1

mysql> show warnings\G
*************************** 1. row ***************************
  Level: Warning
   Code: 1366
Message: Incorrect string value: '\xD8\x00\xDF\x84' for column 'v' at row 1
1 row in set (0.00 sec)

mysql> exit
Bye
macbook-pro:5.5 openxs$ bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 11
Server version: 5.5.7-rc-debug Source distribution

Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into utf8table (select v from utf16table);
Query OK, 2 rows affected, 1 warning (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 1

mysql> show warnings\G
*************************** 1. row ***************************
  Level: Warning
   Code: 1366
Message: Incorrect string value: '\xD8\x00\xDF\x84' for column 'v' at row 1
1 row in set (0.00 sec)
[19 Nov 2010 15:08] Peter Gulutzan
Is there something in the MySQL documentation
to support the statement
Should have been converted to U+FFFD (Replacement Character)"?
[19 Nov 2010 15:10] Bernt Marius Johnsen
Could not find anything in the docs that supports neither U+003F nor U+FFFD.
[19 Nov 2010 15:19] Alexander Barkov
This is not a bug.

We only say that in Uncode-4.0.0 collations characters outside BMP
are compared as REPLACEMENT CHARACTER,

but we don't say that they should be converted to REPLACEMENT CHARACTER.

Impossible conversion always results in QUESTION MARK.
[19 Nov 2010 15:20] Alexander Barkov
Consider changing category to "Documentation".
[19 Nov 2010 15:41] Valeriy Kravchuk
So, may be we should explain this in the documentation?
[22 Nov 2010 12:28] Bernt Marius Johnsen
Something still is wrong, because when you do the same to an ucs2 column, characters outside BMP0 is *NOT* converted to '?' (nor to U+FFFD), but just truncated:

mysql> create table utf16table (v varchar(10) character set utf16);
Query OK, 0 rows affected (0.08 sec)

mysql> insert into utf16table values (unhex('D800DF84')),('?');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> create table ucs2table (v varchar(10) character set ucs2);
Query OK, 0 rows affected (0.08 sec)

mysql> insert into ucs2table (select v from utf16table);
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select hex(v) from ucs2table;
+--------+
| hex(v) |
+--------+
| 0384   |
| 003F   |
+--------+
2 rows in set (0.00 sec)
[25 May 2011 19:42] Paul DuBois
Thank you for your bug report. This issue has been addressed in the documentation. The updated documentation will appear on our website shortly, and will be included in the next release of the relevant products.

Added this to http://dev.mysql.com/doc/refman/5.1/en/charset-unicode.html:

Characters outside the BMP compare as REPLACEMENT CHARACTER and 
convert to '?' when converted to a Unicode character set.