Bug #52690 charsets and mysqldump
Submitted: 8 Apr 2010 12:55 Modified: 20 Apr 2010 19:09
Reporter: Susanne Ebrecht Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: mysqldump Command-line Client Severity:S2 (Serious)
Version: OS:Any
Assigned to: CPU Architecture:Any

[8 Apr 2010 12:55] Susanne Ebrecht
Description:
Workbench -> Administrator -> DATA DUMP

This will occur a mysqldump command:

mysqldump ... ----default-character-set=utf8 ...

Ok. This means character_set_client, character_set_connection and character_set_result will get set to utf8.

This means I will get utf8 encoded data out from the database. Which means my data will be stored in file by using utf8 encoding.

All looks OK.

But know look into the dump file:

/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;

--
-- Table structure for table `t`
--

DROP TABLE IF EXISTS `t`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;

<BREAK>

What is happen above?
What value has character_set_client?

Could be hebrew ... could be latin1 ... could be whatever the client is using when I will restore the dump.

Just say user wasn't careful and it is hebrew. 

<NEXT>

/*!40101 SET character_set_client = utf8 */;

<BREAK>

Now it charset_client is utf8 ... which is ok

<NEXT>

CREATE TABLE `t` (
  `i` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `j` int(11) DEFAULT NULL,
  UNIQUE KEY `i` (`i`)
) ENGINE=MyISAM AUTO_INCREMENT=8193 DEFAULT CHARSET=latin1;

<BREAK>

Table is created by using charset UTF8. Means
when user used e.g. German umlauts as identifiers
(table name or column name or constraint name or index name)
then it will be transfered by using utf8.

It is ok for Unix/Linux systems because most times utf8 is used here today.
But what about Windows? What about Mac?

<NEXT>

/*!40101 SET character_set_client = @saved_cs_client */;

<BREAK>

Aaaaaaaaargs ... what is happen here?

We said that saved_cs_client is hebrew ... which means now character_set_client will be set to hebrew.

Consider, the following data are stored in utf8. So character_set_client should be utf8 and nothing else for getting the data proper back into a database.

<NEXT>
--
-- Dumping data for table `t`
--
.....

<CUT>
<BREAK>

Data were correct utf8 encoded but will get stored in a total gibberish because the file is telling the server that the data are stored in hebrew.

How to repeat:
See above

Suggested fix:
Look that you can fix this charset chaos.
[15 Apr 2010 14:43] Alfredo Kojima
WB already uses --default-character-set=utf8 in mysql cmdline client when importing.
However, the problem with mysqldump itself persists, so I am recategorizing it to a Server bug,
as agreed upon by Susanne.
[20 Apr 2010 19:09] Jim Winstead
I don't see the problem in this poorly-written bug report. The first big block of variable setting by mysqldump circa 5.1 contains this:

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;

Indeed, mysqldump by default always dumps everything using utf8 unless you specifically tell it not to do so, and does a lot of superfluous setting and resetting of the client character set, but I don't see any actual bug here.