Bug #71746 mysqldump may use utf8mb4 as default-character-set
Submitted: 17 Feb 2014 8:53
Reporter: Sadao Hiratsuka Email Updates:
Status: Open Impact on me:
None 
Category:MySQL Server: Charsets Severity:S4 (Feature request)
Version:5.6.16 OS:Any
Assigned to: CPU Architecture:Any
Tags: mysqldump

[17 Feb 2014 8:53] Sadao Hiratsuka
Description:
mysqldump uses utf8 as default-character-set by default,
but it can't dump utf8mb4 data correctly.

If utf8 is chosen for character capacity, utf8mb4 is better than utf8.

How to repeat:
mysql> CREATE TABLE mb4 (point VARCHAR(10) PRIMARY KEY, data VARCHAR(10)) CHARACTER SET utf8mb4;
mysql> INSERT INTO mb4 (point, data) VALUES ('01f300', UNHEX('f09f8c80')); <= U+1F300 CYCLONE
$ mysqldump --no-defaults -u root -p test mb4
...

/*!40101 SET NAMES utf8 */;

--
-- Dumping data for table `mb4`
--

LOCK TABLES `mb4` WRITE;
/*!40000 ALTER TABLE `mb4` DISABLE KEYS */;
INSERT INTO `mb4` VALUES ('01f300','?'); <= can't dump correctly
/*!40000 ALTER TABLE `mb4` ENABLE KEYS */;
UNLOCK TABLES;

Suggested fix:
Please use utf8mb4 as default-character-set.
[5 Mar 2015 22:32] Stephen Van Bruwaene
I'm unable to dump tables that contain utf8mb4 character sets with emoji -- dump happens, but when I try to import, emoji are lost, and I get duplicate errors
[6 Mar 2015 5:31] MySQL Verification Team
Can you dump/reload if you perform the dump with options like this?

mysqldump --hex-blob --default-character-set=binary
[6 Sep 2015 9:05] Daniël van Eeden
Test data: https://github.com/dveeden/mysqlunicodedata (this is a script to import the unicode character database into MySQL, it also has a dump of the data)

The 4-byte characters get replaced by a '?' (why not �, that would be more correct).

The solution is easy: set utf8mb4 as character set for mysqldump.

$ mysqldump --skip-extended-insert unicodedata | grep DOLPHIN
INSERT INTO `ucd` VALUES ('1F42C','?','DOLPHIN','So','0','ON','','','','','N','','','','','');
$ mysqldump --default-character-set=utf8mb4 --skip-extended-insert unicodedata | grep DOLPHIN
INSERT INTO `ucd` VALUES ('1F42C','
[6 Sep 2015 9:06] Daniël van Eeden
With workaround because bugs.mysql.com is not yet using utf8mb4 :(

Test data: https://github.com/dveeden/mysqlunicodedata (this is a script to import the unicode character database into MySQL, it also has a dump of the data)

The 4-byte characters get replaced by a '?' (why not �, that would be more correct).

The solution is easy: set utf8mb4 as character set for mysqldump.

$ mysqldump --skip-extended-insert unicodedata | grep DOLPHIN
INSERT INTO `ucd` VALUES ('1F42C','?','DOLPHIN','So','0','ON','','','','','N','','','','','');
$ mysqldump --default-character-set=utf8mb4 --skip-extended-insert unicodedata | grep DOLPHIN
INSERT INTO `ucd` VALUES ('1F42C','___ACTUAL__DOLPHIN__CHAR__','DOLPHIN','So','0','ON','','','','','N','','','','','');

This data is in a utf8mb4 column, mysqldump should detect that and
- Give an error/warning if it can't properly dump the data
- Use utf8mb4

Why not make utf8mb4 the default for mysqldump?
[6 Sep 2015 9:10] Daniël van Eeden
This also affects mysqlpump, created a separate bugreport for that
Bug #78348 	mysqlpump support for 4-byte UTF-8
[6 Sep 2015 9:12] Daniël van Eeden
For bugs.mysql.com there is this open bug...
Bug #62458 	Not possible to report issues involving utf8mb4 characters
[6 Sep 2015 9:15] Daniël van Eeden
Fixing this could be as easy as:
------------------------------------------------------------------
diff --git a/client/base/mysql_connection_options.cc b/client/base/mysql_connection_options.cc
index 4a11c89..97c78a2 100644
--- a/client/base/mysql_connection_options.cc
+++ b/client/base/mysql_connection_options.cc
@@ -161,7 +161,7 @@ MYSQL* Mysql_connection_options::create_connection()
   }
   else
   {
-    mysql_options(connection, MYSQL_SET_CHARSET_NAME, "utf8");
+    mysql_options(connection, MYSQL_SET_CHARSET_NAME, "utf8mb4");
   }
   if (this->m_plugin_dir.has_value())
     mysql_options(connection, MYSQL_PLUGIN_DIR,
------------------------------------------------------------------
[7 Sep 2015 9:57] Daniël van Eeden
I don't consider this 'S4 (Feature request)', but 'S2 (Serious)' because it silently drops the data.
[8 Sep 2015 5:30] MySQL Verification Team
it does not matter what charset is used - one cannot mysqldump text data and import it properly. You have to use hex for all char fields.
I filed this years ago, internally:

Bug 14271638 - CHARACTER SET DUPLICATE HANDLING SEEMS BROKEN 

e.g. Try dump this table and import it:
---
set names utf8;
drop table if exists t1;
create table t1(a varchar(1)
character set armscii8 collate armscii8_bin, unique key(a))engine=innodb;
insert into t1 values (0x29),(0xa4);
---
[5 Feb 2016 15:16] Brad Kent
MySQL Workbench also passes default-character-set=utf8  and the option is not configurable
[1 Jan 2018 1:31] Daniel Black
utf8mb4 made the default in ebaff9fffc958030a57d8ea7f1f2d527cac1df64 / WL#10353