Bug #58321 No warning when characters outside BMP0 is converted to UCS2
Submitted: 19 Nov 2010 11:56 Modified: 22 Dec 2010 19:27
Reporter: Bernt Marius Johnsen Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.6, 5.5 OS:Any
Assigned to: Alexander Barkov
Triage: Triaged: D2 (Serious)

[19 Nov 2010 11:56] Bernt Marius Johnsen
Description:
When characters outside BMP0 is converted to utf8, a warning is issued. When they are converted to ucs2, no warning is issued.

mysql> create table t1 (v varchar(10) character set utf32);
Query OK, 0 rows affected (0.11 sec)

mysql> insert into t1 values (0x00012345);
Query OK, 1 row affected (0.00 sec)

mysql> create table t2 (v varchar(10) character set utf8);
Query OK, 0 rows affected (0.10 sec)

mysql> create table t3 (v varchar(10) character set ucs2);
Query OK, 0 rows affected (0.09 sec)

mysql> insert into t2 (select * from t1);
Query OK, 1 row affected, 1 warning (0.00 sec)
Records: 1  Duplicates: 0  Warnings: 1

mysql> show warnings;
+---------+------+--------------------------------------------------------------------+
| Level   | Code | Message                                                            |
+---------+------+--------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\x00\x01\x23\x45' for column 'v' at row 1 |
+---------+------+--------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> insert into t3 (select * from t1);
Query OK, 1 row affected (0.00 sec)
Records: 1  Duplicates: 0  Warnings: 0

How to repeat:
create table t1 (v varchar(10) character set utf32);
insert into t1 values (0x00012345);
create table t2 (v varchar(10) character set utf8);
create table t3 (v varchar(10) character set ucs2);
insert into t2 (select * from t1);
show warnings;
insert into t3 (select * from t1);
[19 Nov 2010 12:30] Susanne Ebrecht
Many thanks for writing a bug report.

You missed to use SET NAMES here to set client encoding to your environment encoding.
[19 Nov 2010 12:40] Bernt Marius Johnsen
SET NAMES is of no relevance to my example, but since you insist:

mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> create table t1 (v varchar(10) character set utf32);
Query OK, 0 rows affected (0.06 sec)

mysql> insert into t1 values (0x00012345);
Query OK, 1 row affected (0.00 sec)

mysql> create table t2 (v varchar(10) character set utf8);
Query OK, 0 rows affected (0.07 sec)

mysql> create table t3 (v varchar(10) character set ucs2);
Query OK, 0 rows affected (0.07 sec)

mysql> insert into t2 (select * from t1);
Query OK, 1 row affected, 1 warning (0.00 sec)
Records: 1  Duplicates: 0  Warnings: 1

mysql> show warnings;
+---------+------+--------------------------------------------------------------------+
| Level   | Code | Message                                                            |
+---------+------+--------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\x00\x01\x23\x45' for column 'v' at row 1 |
+---------+------+--------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> insert into t3 (select * from t1);
Query OK, 1 row affected (0.00 sec)
Records: 1  Duplicates: 0  Warnings: 0
[19 Nov 2010 12:53] Susanne Ebrecht
0x00012345 is not a valid UTF8 sign.

Take here the valid utf8 sign.
[19 Nov 2010 12:57] Bernt Marius Johnsen
Of course 0x00012345 is not a valid UTF8 "sign", it's a valid Unicode character outside BMP0 and thus stored in a table with character set UTF32. That's the whole point! When the conversion to UTF8 is attempted, I get a warning (which is correct). When conversion to UCS2 is attempted, I DO NOT get a warning. And that's the bug.
[19 Nov 2010 13:48] Susanne Ebrecht
You also not get warning in utf16.

http://www.unicode.org/faq/basic_q.html#14

http://dev.mysql.com/doc/refman/5.0/en/charset-unicode.html
http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-ucs2.html
[19 Nov 2010 15:12] Bernt Marius Johnsen
Of course you dont get a warning for utf16, since 0x00012345 is supported by utf16. The problem is: YOU DO NOT GET A WARNING WHEN IT IS TRUNCATED TO UCS2 WHICH DO NOT SUPPORT CHARACTERS LARGER THAN 0x0000FFFF!!!!
[19 Nov 2010 19:25] Sveta Smirnova
Thank you for the report.

Verified as described although 0X00012345 can be successfully inserted too:

set names utf8;
create table t1 (v varchar(10) character set utf32);
insert into t1 values (0x00012345);
create table t2 (v varchar(10) character set utf8);
create table t3 (v varchar(10) character set ucs2);
insert into t2 (select * from t1);
Warnings:
Warning 1366    Incorrect string value: '\x00\x01\x23\x45' for column 'v' at row 1
show warnings;
Level   Code    Message
Warning 1366    Incorrect string value: '\x00\x01\x23\x45' for column 'v' at row 1
insert into t3 (select * from t1);
show warnings;
Level   Code    Message
insert into t3 values (0x00012345);
select hex(v) from t1;
hex(v)
00012345
select hex(v) from t2;
hex(v)
3F
select hex(v) from t3;
hex(v)
2345
00012345
[22 Nov 2010 12:22] Bernt Marius Johnsen
Well, according to the docs ch 9.1.10:

ucs2 and utf8 support BMP characters. utf8mb4, utf16, and utf32 support BMP and supplementary characters. 

So 0x0001234 should not be possible in UCS2, so here something is missing (truncation/error/warning ....)
[22 Nov 2010 12:38] Bernt Marius Johnsen
Ah, the last one is correct since 0x00012345 in ucs2 context is interpreted as a string if *TWO* UCS2 characters: 0x0001 and 0x1234.
[23 Nov 2010 18:42] Peter Gulutzan
This bug occurs in 5.5 as well as 5.6,
and occurs for conversions from utf16 as well as utf32,
and is a bit worse than 'lack of warning':
the 12345 becomes 2345.
[15 Dec 2010 10:07] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/126898

3195 Alexander Barkov	2010-12-15
      Bug#58321 No warning when characters outside BMP0 is converted to UCS2
      
      Problem: when inserting supplementary characters to an UCS2 column,
      character was silently shrinked to 16-bit value.
      
      Fix: produce a warning on attempt to insert a supplementary character,
      and convert to question mark.
      
        @ mysql-test/r/ctype_many.result
        @ mysql-test/t/ctype_many.test
        Adding tests
      
        @ strings/ctype-ucs2.c
        Check if wc is greater than the highest value supported (0xFFFF),
        return MY_CS_ILUNI if true.
[15 Dec 2010 10:41] Alexander Barkov
Pushed into mysql-5.5-bugteam (5.5.8)
Pushed into mysql-trunk-bugfixing (5.6.1-m5)
[17 Dec 2010 12:51] Bugs System
Pushed into mysql-5.5 5.5.9 (revid:georgi.kodinov@oracle.com-20101217124733-p1ivu6higouawv8l) (version source revid:bar@mysql.com-20101215095837-lkrt9d2srudod5b5) (merge vers: 5.5.8) (pib:24)
[17 Dec 2010 12:55] Bugs System
Pushed into mysql-trunk 5.6.1 (revid:georgi.kodinov@oracle.com-20101217125013-y8pb3az32rtbplc9) (version source revid:bar@mysql.com-20101215103503-jge0qg33poet3ffh) (merge vers: 5.6.1) (pib:24)
[22 Dec 2010 19:27] Paul Dubois
Noted in 5.5.9 changelog.

The ucs2 character set does not support characters outside the Basic
Multilingual Plane (BMP), but converting a string containing such
characters did not produce a conversion-failure warning.