Bug #64180 | Mysql does not accept unicode above 0xFFFF in utf8_* collations | ||
---|---|---|---|
Submitted: | 31 Jan 2012 13:46 | Modified: | 6 Jul 2012 15:36 |
Reporter: | Jonas Wielicki | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Server: Charsets | Severity: | S3 (Non-critical) |
Version: | 5.5.19 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | unicode catface |
[31 Jan 2012 13:46]
Jonas Wielicki
[31 Jan 2012 14:12]
Valeriy Kravchuk
Same on Windows: C:\Program Files\MySQL\MySQL Server 5.5\bin>mysql -uroot -proot test Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 24 Server version: 5.5.19 MySQL Community Server (GPL) Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> set charset utf8; Query OK, 0 rows affected (0.11 sec) mysql> CREATE TABLE `test`.`test` ( -> `Test` TEXT CHARACTER SET utf8 COLLATE utf8_bin NOT NULL -> ) ENGINE = InnoDB CHARACTER SET utf8 COLLATE utf8_bin; Query OK, 0 rows affected (0.44 sec) mysql> INSERT INTO `test`.`test` (`Test`) VALUES (UNHEX('F09F98B8')); ERROR 1366 (HY000): Incorrect string value: '\xF0\x9F\x98\xB8' for column 'Test' at row 1 mysql> show variables like 'char%'; +--------------------------+---------------------------------------------------- -----+ | Variable_name | Value | +--------------------------+---------------------------------------------------- -----+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | utf8 | | character_set_system | utf8 | | character_sets_dir | C:\Program Files\MySQL\MySQL Server 5.5\share\chars ets\ | +--------------------------+---------------------------------------------------- -----+ 8 rows in set (0.05 sec)
[31 Jan 2012 15:09]
Peter Laursen
Aren't you/we supposed to use utf8_mb4 charset in such cases?
[31 Jan 2012 15:45]
Jonas Wielicki
Is there any design rationale splitting the UTF-8 room into separate charsets? And what exactly will happen if we some day need five UTF-8 bytes (UTF-8 may go up to eight bytes, a 0xF7 leading byte and seven data bytes)?
[31 Jan 2012 16:03]
Peter Laursen
MySQL will have to reply to this (and for clarity I am not a MySQL employee). Actually in (the now discontinued) MySQL 6.0 (and maybe even 5.2) 'utf8' had a max length of 4 bytes per character. A utf8_mb3 charset was introduced with a max length of 3 bytes (and totally identical to utf8 in MySQL 4.1, 5.0 and 5.1). With MySQL 5.5 utf8_mb4 was introduced (identical to utf8 in MySQL 6.0) and utf8 was continues identical to utf8 in MySQL 4.1, 5.0 and 5.1 (and indentical to utf8_mb3 in MySQL 6.0). I think the basic reason for the difference in in 6.0 and 5.5/5.6 is that upgrading from 5.1 to 6.0 had some problem with utf8 data. A rationale could be that very few users will use unicode characters outside the Basic Multilingual Plane in their database and there probably is a little penalty the more bytes you allow per character. There is a funny detail more that I reported here: http://bugs.mysql.com/bug.php?id=62458 . If you have a test case with utf8_mb4 it cannot be handled properly here on bugs.mysql.com - probably becuase the database backend does not use utf8_mb4 (webservers and browsers have no problem as long as there is font support). so you have a feature for which bugs cannot be reported (unless attaching text files etc.)
[6 Jul 2012 15:35]
Alexander Barkov
This is not a bug. Character set "utf8" supports only characters in the range U+0000..U+FFFF. To use characters U+10000 and higher, please use character set "utf8mb4" instead. mysql> drop table if exists t1; CREATE TABLE t1 (a TEXT CHARACTER SET utf8mb4); INSERT INTO t1 VALUES (UNHEX('F09F98B8')); SELECT hex(a) FROM t1; Query OK, 0 rows affected (0.09 sec) Query OK, 0 rows affected (0.33 sec) Query OK, 1 row affected (0.02 sec) +----------+ | hex(a) | +----------+ | F09F98B8 | +----------+ 1 row in set (0.00 sec)