| Bug #64180 | Mysql does not accept unicode above 0xFFFF in utf8_* collations | ||
|---|---|---|---|
| Submitted: | 31 Jan 2012 13:46 | Modified: | 6 Jul 2012 15:36 | 
| Reporter: | Jonas Wielicki | Email Updates: | |
| Status: | Not a Bug | Impact on me: | |
| Category: | MySQL Server: Charsets | Severity: | S3 (Non-critical) | 
| Version: | 5.5.19 | OS: | Any | 
| Assigned to: | CPU Architecture: | Any | |
| Tags: | unicode catface | ||
   [31 Jan 2012 14:12]
   Valeriy Kravchuk        
  Same on Windows:
C:\Program Files\MySQL\MySQL Server 5.5\bin>mysql -uroot -proot test
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 24
Server version: 5.5.19 MySQL Community Server (GPL)
Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> set charset utf8;
Query OK, 0 rows affected (0.11 sec)
mysql> CREATE TABLE `test`.`test` (
    ->   `Test` TEXT CHARACTER SET utf8 COLLATE utf8_bin NOT NULL
    -> ) ENGINE = InnoDB CHARACTER SET utf8 COLLATE utf8_bin;
Query OK, 0 rows affected (0.44 sec)
mysql> INSERT INTO `test`.`test` (`Test`) VALUES (UNHEX('F09F98B8'));
ERROR 1366 (HY000): Incorrect string value: '\xF0\x9F\x98\xB8' for column 'Test'
 at row 1
mysql> show variables like 'char%';
+--------------------------+----------------------------------------------------
-----+
| Variable_name            | Value
     |
+--------------------------+----------------------------------------------------
-----+
| character_set_client     | utf8
     |
| character_set_connection | utf8
     |
| character_set_database   | utf8
     |
| character_set_filesystem | binary
     |
| character_set_results    | utf8
     |
| character_set_server     | utf8
     |
| character_set_system     | utf8
     |
| character_sets_dir       | C:\Program Files\MySQL\MySQL Server 5.5\share\chars
ets\ |
+--------------------------+----------------------------------------------------
-----+
8 rows in set (0.05 sec)
 
   [31 Jan 2012 15:09]
   Peter Laursen        
  Aren't you/we supposed to use utf8_mb4 charset in such cases?
   [31 Jan 2012 15:45]
   Jonas Wielicki        
  Is there any design rationale splitting the UTF-8 room into separate charsets? And what exactly will happen if we some day need five UTF-8 bytes (UTF-8 may go up to eight bytes, a 0xF7 leading byte and seven data bytes)?
   [31 Jan 2012 16:03]
   Peter Laursen        
  MySQL will have to reply to this (and for clarity I am not a MySQL employee). Actually in (the now discontinued) MySQL 6.0 (and maybe even 5.2) 'utf8' had a max length of 4 bytes per character. A utf8_mb3 charset was introduced with a max length of 3 bytes (and totally identical to utf8 in MySQL 4.1, 5.0 and 5.1). With MySQL 5.5 utf8_mb4 was introduced (identical to utf8 in MySQL 6.0) and utf8 was continues identical to utf8 in MySQL 4.1, 5.0 and 5.1 (and indentical to utf8_mb3 in MySQL 6.0). I think the basic reason for the difference in in 6.0 and 5.5/5.6 is that upgrading from 5.1 to 6.0 had some problem with utf8 data. A rationale could be that very few users will use unicode characters outside the Basic Multilingual Plane in their database and there probably is a little penalty the more bytes you allow per character. There is a funny detail more that I reported here: http://bugs.mysql.com/bug.php?id=62458 . If you have a test case with utf8_mb4 it cannot be handled properly here on bugs.mysql.com - probably becuase the database backend does not use utf8_mb4 (webservers and browsers have no problem as long as there is font support). so you have a feature for which bugs cannot be reported (unless attaching text files etc.)
   [6 Jul 2012 15:35]
   Alexander Barkov        
  This is not a bug. Character set "utf8" supports only characters in the range
U+0000..U+FFFF.
To use characters U+10000 and higher, please use character set "utf8mb4"
instead.
mysql> drop table if exists t1; CREATE TABLE t1 (a TEXT CHARACTER SET
utf8mb4); INSERT INTO t1 VALUES (UNHEX('F09F98B8')); SELECT hex(a) FROM t1;
Query OK, 0 rows affected (0.09 sec)
Query OK, 0 rows affected (0.33 sec)
Query OK, 1 row affected (0.02 sec)
+----------+
| hex(a)   |
+----------+
| F09F98B8 |
+----------+
1 row in set (0.00 sec)
 
Description: Mysql refuses utf8 characters with a unicode codepoint value above 0xFFFF. This might be related to the connection encoding which was set to utf8. We get the following warning: Warning: #1366 Incorrect string value: '\xF0\x9F\x98\xB8' for column 'Test' at row 1 for the example character (not including the character itself here): U+1F638 (128568) GRINNING CAT FACE WITH SMILING EYES The character 0xFFEE works, while 0x10000 does not. Tested with: mysql --version mysql Ver 14.14 Distrib 5.5.19, for Linux (x86_64) using readline 5.1 echo $LANG en_GB.utf8 How to repeat: 1. CREATE DATABASE `test`; 2. SET CHARSET utf8; 3. CREATE TABLE `test`.`test` ( `Test` TEXT CHARACTER SET utf8 COLLATE utf8_bin NOT NULL ) ENGINE = InnoDB CHARACTER SET utf8 COLLATE utf8_bin; 4. INSERT INTO `test`.`test` (`Test`) VALUES (UNHEX('F09F98B8')); instead of UNHEX you can also pass the character itself (can be copied from here: <https://en.wikipedia.org/wiki/List_of_emoticons#Unicode_characters>).