Bug #46819 mysqldump without default-character-set made invalid string data.
Submitted: 20 Aug 2009 7:26 Modified: 20 Aug 2009 8:11
Reporter: Meiji KIMURA Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: mysqldump Command-line Client Severity:S3 (Non-critical)
Version:5.0, 5.1, next bzr OS:Any
Assigned to: CPU Architecture:Any

[20 Aug 2009 7:26] Meiji KIMURA
Description:
mysqldump without default-character-set assume that character set is utf8.

http://dev.mysql.com/doc/refman/5.1/en/mysqldump.html
If no character set is specified, mysqldump uses utf8, and earlier versions use latin1.

Almost of all cases it works well, but if my.conf options specified like this, mysqldump made invalid string data.

[mysqld]
default-character-set = eucjpms
skip-character-set-client-handshake

At that time, mysqldump's header display 'utf8', but all string data are encoded into eucjpms.
So it cannot recover original database data. 

How to repeat:
(1) Set my.cnf's option like these.

[mysqld]
default-character-set = eucjpms
skip-character-set-client-handshake

(2) Create DB, table, data.
CREATE DATABASE ch_check DEFAULT CHARACTER SET utf8;

CREATE TABLE t1(
`ch_utf8` char(1) character set utf8 NOT NULL,
`ch_eucjpms` char(1) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=eucjpms

insert into t1 values('開','開');

mysql> select hex(ch_utf8), hex(ch_eucjpms) from t1;
+--------------+-----------------+
| hex(ch_utf8) | hex(ch_eucjpms) |
+--------------+-----------------+
| E9968B       | B3AB            | 
+--------------+-----------------+
1 row in set (0.00 sec)

'E9968B' means '開' in utf8.
'B3AB' means '開' in eucjpms.

(3) dump data.

mysqldump --opt --lock-tables=FALSE -t -uroot ch_check t1 > t1.dmp;

(4) check data. 
Here is a part of t1.dmp via binary editor. Values are encoded in eucjpms('B3AB').
They should be 'E9968B'.
 
00000390  41 42 4C 45 20 4B 45 59 53 20 2A 2F 3B 0A 49 4E ABLE KEYS */;.IN
000003A0  53 45 52 54 20 49 4E 54 4F 20 60 74 31 60 20 56 SERT INTO `t1` V
000003B0  41 4C 55 45 53 20 28 27 B3 AB 27 2C 27 B3 AB 27 ALUES ('..','..'
000003C0  29 3B 0A 2F 2A 21 34 30 30 30 30 20 41 4C 54 45 );./*!40000 ALTE

Here is a t1.dmp.
-- MySQL dump 10.13  Distrib 5.1.31sp1, for unknown-linux-gnu (x86_64)
--
-- Host: localhost    Database: ch_check
-- ------------------------------------------------------
-- Server version       5.1.31sp1-enterprise-gpl-advanced

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;
/*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;
/*!40103 SET TIME_ZONE='+00:00' */;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;

--
-- Dumping data for table `t1`
--

LOCK TABLES `t1` WRITE;
/*!40000 ALTER TABLE `t1` DISABLE KEYS */;
INSERT INTO `t1` VALUES ('³«','³«');
/*!40000 ALTER TABLE `t1` ENABLE KEYS */;
UNLOCK TABLES;
/*!40103 SET TIME_ZONE=@OLD_TIME_ZONE */;

/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */;
/*!40101 SET CHARACTER_SET_RESULTS=@OLD_CHARACTER_SET_RESULTS */;
/*!40101 SET COLLATION_CONNECTION=@OLD_COLLATION_CONNECTION */;
/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;

-- Dump completed on 2009-08-20  7:19:37

Suggested fix:
[workaround]

Specify default-character-set explicitly.

mysqldump --default-character-set=eucjpms --opt --lock-tables=FALSE -t -uroot ch_check t1 > t1.dmp; 

[Suggested fix]

mysqldump use 'SET NAMES' based on default-character-set.
[20 Aug 2009 7:36] Meiji KIMURA
I'm wrong. Here is a right [suggest to fix]

mysqldump use 'SET NAMES' based on skip-character-set-client-handshake. (it's determined by default-character-set).
[20 Aug 2009 8:11] Sveta Smirnova
Thank you for the report.

Verified as described. If not specify option -t mysqldump dumps everything correctly.
[21 Aug 2009 1:02] Meiji KIMURA
As mentioned by sveta, it runs well without -t option.

mysqldump --opt --lock-tables=FALSE -uroot ch_check t1 > t2.dmp;

All string data are encoded in utf8.(E9968B).

000004E0  20 44 49 53 41 42 4C 45 20 4B 45 59 53 20 2A 2F  DISABLE KEYS */
000004F0  3B 0A 49 4E 53 45 52 54 20 49 4E 54 4F 20 60 74 ;.INSERT INTO `t
00000500  31 60 20 56 41 4C 55 45 53 20 28 27 E9 96 8B 27 1` VALUES ('...'
00000510  2C 27 E9 96 8B 27 29 3B 0A 2F 2A 21 34 30 30 30 ,'...');./*!4000
[8 Nov 2010 20:07] Sheeri Cabral
This is actually a very big problem, because often people put non US-ASCII characters in latin1.  When these characters are exported with mysqldump with a default of utf8, the characters are garbled, but there's no warnings or errors.  

This can lead to lost data, especially when upgrading a major series (e.g. 5.0 to 5.1) because you're supposed to export and import the data.  When importing a backup of an InnoDB table, if there is an error with one of the parts of the INSERT, the whole INSERT statement rolls back.  I have experienced major data loss because the garbled characters cause an error when INSERTed, and it causes perfectly fine data not to import because they're in the same INSERT statement as the garbled data.

This is a huge issue for anyone who has upgraded from previous versions where mysqldump (and the server) were latin1 by default.  Now their server is still latin1, but mysqldump is happening with utf8.

Here's an example:

First, set variables such on a MySQL server (5.0 or 5.1):
mysql> show global variables like '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     | 
| character_set_connection | latin1                     | 
| character_set_database   | latin1                     | 
| character_set_filesystem | binary                     | 
| character_set_results    | latin1                     | 
| character_set_server     | latin1                     | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

Then create these tables with data:

CREATE TABLE `test1` (
  `kwid` int(10) unsigned NOT NULL default '0',
  `keyword` varchar(80) NOT NULL default ''
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

INSERT INTO `test1` VALUES (1,'watching'),(2,'poet'),(3,'просмотра'),(4,'Поэту');

CREATE TABLE `test2` (
  `kwid` int(10) unsigned NOT NULL default '0',
  `keyword` varchar(80) NOT NULL default ''
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

INSERT INTO `test2` VALUES (1,'watching'),(2,'poet'),(3,'просмотра'),(4,'Поэту');

Now compare:
 mysqldump -u root lj > test_export_utf8.sql
 mysqldump --default-character-set=latin1 -u root lj > test_export_latin1.sql

Note that the test export with utf8 has garbled data whereas the export with latin1 does not.  mysqldump does set the character_set_client but it is not enough.

mysqldump should be changed to set the default character set to whatever the server is set to.  There is no reason this bug has to exist.
[8 Nov 2010 20:07] Sheeri Cabral
I would say that this is in fact a serious bug, not non-critical.
[8 Nov 2010 20:28] Sheeri Cabral
I just confirmed that the same behavior in the test case is apparent even when the server charset variables are set to:

mysql> show global variables like 'char%';
+--------------------------+--------------------------------------------------------+
| Variable_name            | Value                                                  |
+--------------------------+--------------------------------------------------------+
| character_set_client     | utf8                                                   |
| character_set_connection | utf8                                                   |
| character_set_database   | utf8                                                   |
| character_set_filesystem | binary                                                 |
| character_set_results    | utf8                                                   |
| character_set_server     | utf8                                                   |
| character_set_system     | utf8                                                   |
| character_sets_dir       | /usr/local/mysql-5.1.50-osx10.6-x86_64/share/charsets/ |
+--------------------------+--------------------------------------------------------+
8 rows in set (0.00 sec)

So I'm not sure why a server with utf8 as the default requires mysqldump to use latin1 to export properly, even for a utf8 table.