MySQL Bugs: #64180: Mysql does not accept unicode above 0xFFFF in utf8

Bug #64180	Mysql does not accept unicode above 0xFFFF in utf8_* collations
Submitted:	31 Jan 2012 13:46	Modified:	6 Jul 2012 15:36
Reporter:	Jonas Wielicki	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S3 (Non-critical)
Version:	5.5.19	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	unicode catface

Description:
Mysql refuses utf8 characters with a unicode codepoint value above 0xFFFF. This might be related to the connection encoding which was set to utf8.

We get the following warning:
Warning: #1366 Incorrect string value: '\xF0\x9F\x98\xB8' for column 'Test' at row 1
for the example character (not including the character itself here):
U+1F638 (128568) 	GRINNING CAT FACE WITH SMILING EYES

The character 0xFFEE works, while 0x10000 does not.

Tested with:
mysql --version
mysql  Ver 14.14 Distrib 5.5.19, for Linux (x86_64) using readline 5.1
echo $LANG
en_GB.utf8

How to repeat:
1.
CREATE DATABASE `test`;
2.
SET CHARSET utf8;
3.
CREATE TABLE `test`.`test` (
  `Test` TEXT CHARACTER SET utf8 COLLATE utf8_bin NOT NULL
) ENGINE = InnoDB CHARACTER SET utf8 COLLATE utf8_bin;
4.
INSERT INTO `test`.`test` (`Test`) VALUES (UNHEX('F09F98B8'));

instead of UNHEX you can also pass the character itself (can be copied from here: <https://en.wikipedia.org/wiki/List_of_emoticons#Unicode_characters>).

Same on Windows:

C:\Program Files\MySQL\MySQL Server 5.5\bin>mysql -uroot -proot test
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 24
Server version: 5.5.19 MySQL Community Server (GPL)

Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> set charset utf8;
Query OK, 0 rows affected (0.11 sec)

mysql> CREATE TABLE `test`.`test` (
    ->   `Test` TEXT CHARACTER SET utf8 COLLATE utf8_bin NOT NULL
    -> ) ENGINE = InnoDB CHARACTER SET utf8 COLLATE utf8_bin;
Query OK, 0 rows affected (0.44 sec)

mysql> INSERT INTO `test`.`test` (`Test`) VALUES (UNHEX('F09F98B8'));
ERROR 1366 (HY000): Incorrect string value: '\xF0\x9F\x98\xB8' for column 'Test'
 at row 1
mysql> show variables like 'char%';
+--------------------------+----------------------------------------------------
-----+
| Variable_name            | Value
     |
+--------------------------+----------------------------------------------------
-----+
| character_set_client     | utf8
     |
| character_set_connection | utf8
     |
| character_set_database   | utf8
     |
| character_set_filesystem | binary
     |
| character_set_results    | utf8
     |
| character_set_server     | utf8
     |
| character_set_system     | utf8
     |
| character_sets_dir       | C:\Program Files\MySQL\MySQL Server 5.5\share\chars
ets\ |
+--------------------------+----------------------------------------------------
-----+
8 rows in set (0.05 sec)

Aren't you/we supposed to use utf8_mb4 charset in such cases?

Is there any design rationale splitting the UTF-8 room into separate charsets? And what exactly will happen if we some day need five UTF-8 bytes (UTF-8 may go up to eight bytes, a 0xF7 leading byte and seven data bytes)?

MySQL will have to reply to this (and for clarity I am not a MySQL employee). 

Actually in (the now discontinued) MySQL 6.0 (and maybe even 5.2) 'utf8' had a max length of 4 bytes per character. A utf8_mb3 charset was introduced with a max length of 3 bytes (and totally identical to utf8 in MySQL 4.1, 5.0 and 5.1).  With MySQL 5.5 utf8_mb4 was introduced (identical to utf8 in MySQL 6.0) and utf8 was continues identical to utf8 in MySQL 4.1, 5.0 and 5.1 (and indentical to utf8_mb3 in MySQL 6.0).

I think the basic reason for the difference in in 6.0 and 5.5/5.6 is that upgrading from 5.1 to 6.0 had some problem with utf8 data.  A rationale could be that very few users will use unicode characters outside the Basic Multilingual Plane in their database and there probably is a little penalty the more bytes you allow per character. 

There is a funny detail more that I reported here: http://bugs.mysql.com/bug.php?id=62458 .  If you have a test case with utf8_mb4 it cannot be handled properly here on bugs.mysql.com - probably becuase the database backend does not use utf8_mb4 (webservers and browsers have no problem as long as there is font support).  so you have a feature for which bugs cannot be reported (unless attaching text files etc.)

This is not a bug. Character set "utf8" supports only characters in the range
U+0000..U+FFFF.
To use characters U+10000 and higher, please use character set "utf8mb4"
instead.

mysql> drop table if exists t1; CREATE TABLE t1 (a TEXT CHARACTER SET
utf8mb4); INSERT INTO t1 VALUES (UNHEX('F09F98B8')); SELECT hex(a) FROM t1;
Query OK, 0 rows affected (0.09 sec)

Query OK, 0 rows affected (0.33 sec)

Query OK, 1 row affected (0.02 sec)

+----------+
| hex(a)   |
+----------+
| F09F98B8 |
+----------+
1 row in set (0.00 sec)