Bug #30946 mysqldump silently ignores --default-character-set when used with --tab
Submitted: 11 Sep 2007 7:19 Modified: 18 Aug 2009 18:11
Reporter: Marek Matula Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: mysqldump Command-line Client Severity:S3 (Non-critical)
Version:5.0.45 OS:Linux
Assigned to: Gleb Shchepa CPU Architecture:Any
Tags: character set, mysqldump, tab

[11 Sep 2007 7:19] Marek Matula
Description:
When mysqldump is used with both "--default-character-set" and "--tab" the database content in tab-separated data files is always in its internal character set and the --default-character-set switch seems to be ignored. According to the documentation it should work: it is even suggested for data dumps:

"[...] if you write a data file with mysqldump -T or by issuing a SELECT ... INTO OUTFILE statement in mysql, be sure to use a --default-character-set option with mysqldump or mysql so that output is written in the character set to be used when the file is loaded with LOAD DATA INFILE."

How to repeat:
1. Create a table with non-UTF8 character set: eg. latin1 or latin2
2. insert some data containing non-ASCII characters
3. dump the table with --default-character-set=utf8 and --tab=/path/to/dump
4. The data in tab-separated file is still in latin1 or latin2 and not UTF8

Suggested fix:
mysqldump should not ignore --default-character-set when dumping data with --tab
[18 Sep 2007 23:42] MySQL Verification Team
Thank you for the bug report. Maybe a documentation bug.
[19 Sep 2007 6:30] Marek Matula
I believe this is not a documentation bug. If it was, it would be not possible to dump (and restore) tables that have more than one charset in their fields!
[26 Sep 2007 17:23] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/34566

ChangeSet@1.2531, 2007-09-26 18:22:25+02:00, gshchepa@devsrv-b.mysql.com +5 -0
  Fixed bug #30946.
  When mysqldump was used with both --default-character-set and
  --tab options the database content in tab-separated data files
  was always in its internal character set and the
  --default-character-set switch was ignored.
  
  The mysqldump client uses the "SELECT * INTO OUTFILE" statement
  to provide the --tab command line switch logic. It is supposed
  to use the symmetrical "LOAD DATA INFILE" statement to load a
  resulting data file back to a database.
  By the definition the "LOAD DATA INFILE" statement interprets
  all fields in the file as having the same character set, regardless
  of the data types of the columns into which field values are loaded.
  The character_set_database system variable or the "CHARACTER
  SET" closure of the "LOAD DATA INFILE" statement is used to define
  that character set. However, the "SELECT ... INTO OUTFILE" statement
  dumps table fields in its internal character sets, so "SELECT INTO
  OUTFILE" and "LOAD DATA INFILE" statements are not symmetrical.
  To not to modify current server semantic (only to extend it) the
  "SELECT ... INTO OUTFILE" statement syntax was extended with the
  "CHARACTER SET" statement like in the "LOAD DATA INFILE".
  
  Documentation should be updated:
  [ http://dev.mysql.com/doc/refman/5.1/en/select.html ]
  SELECT ... INTO OUTFILE is the complement of LOAD DATA INFILE;
  the syntax for the export_options part of the statement consists of
  the same CHARACTER SET, FIELDS and LINES clauses that are used with
          +^^^^^^^^^^^^^^
  the LOAD DATA INFILE statement.
[6 Oct 2007 11:09] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/35051

ChangeSet@1.2582, 2007-10-06 16:10:17+05:00, gshchepa@gleb.loc +7 -0
  Bug#30946: mysqldump silently ignores --default-character-set
  when used with --tab.
  
  Added optional CHARACTER SET clause to the SELECT ... INTO OUTFILE
  (to complement the same clause in LOAD DATA INFILE).
  mysqldump is updated to use it in --tab mode.
[26 Jan 2008 17:17] Baron Schwartz
What does the following mean?

"content in tab-separated data files was always in its internal character set"

Whose internal character set?  The MySQL server, mysqldump, the database, the table, the columns?

It would be very helpful if the documentation answered the question, "What character set is used for SELECT INTO OUTFILE?"
[30 Jan 2008 12:33] Gleb Shchepa
> What does the following mean?

> "content in tab-separated data files was always in its internal character set"

> Whose internal character set?  The MySQL server, mysqldump, the database,
> the table, the columns?

Character set of each column (charset of whole row is a mix of column charsets, or in other words it is BINARY charset).

> It would be very helpful if the documentation answered the question,
>  "What character set is used for SELECT INTO OUTFILE?"

Current implementation uses BINARY charset. Bugfix will extend SELECT INTO OUTFILE syntax to allow data dumping with a special charset value.
[5 Feb 2008 2:38] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/41685

ChangeSet@1.2594, 2008-02-05 06:37:04+04:00, gshchepa@host.loc +10 -0
  Fixed bugs #30946 and #31677.
  Bug#30946: mysqldump silently ignores --default-character-set
  when used with --tab.
  
  Added optional CHARACTER SET clause to the SELECT ... INTO OUTFILE
  (to complement the same clause in LOAD DATA INFILE).
  mysqldump is updated to use it in --tab mode.
  
  Added complete support for multibyte delimiter characters,
  support for UCS2 harsets and hexadecimal representation of
  malformed characters been added.
  
  Fixed bug with 8-bit separators of LOAD DATA INFILE statement.
  
  New error message has been provided: "Malformed ... delimiter value",
  where "..." is one of "FIELDS TERMINATED BY", "FIELDS ENCLOSED BY",
  "FIELDS ESCAPED BY", "LINES TERMINATED BY" or "LINES STARTING BY",
  to report errors on bad SELECT ... INTO OUTFILE separator values.
[19 Mar 2008 14:41] Alexander Barkov
Hi Gleb, I applied your patch and found some problems:

1. Crash with ucs2 table and big5 output:

CREATE TABLE t1 (
  a varchar(10) default NULL,
  b varchar(20) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=ucs2;

delete from t1;
insert into t1 values (0xFFFF,0xFFFF);
select * into outfile 't1.txt' character set big5 from t1;

ERROR 2013 (HY000): Lost connection to MySQL server during query

2. Wrong file with ucs2 table and default (binary) character set:

drop table if exists t1;
create table t1 (a varchar(10), b varchar(20)) character set ucs2;
insert into t1 values ('a1','b1');
select * into outfile 't1.txt' from t1;
select load_File('test/t1.txt');

+--------------------------+
| load_File('test/t1.txt') |
+--------------------------+
| \0\0\01       \0\0\01
         |
+--------------------------+

This result is wrong. I would expect:
\0a\01<tab>\0b\01<tab><nl>

More comments coming
[26 Jun 2008 14:36] Susanne Ebrecht
Sergei set bug #37610 as duplicate of this bug here.

I made some comments and also debugging on that bug. Please read it also.
[23 Mar 2009 13:45] Susanne Ebrecht
Bug #43288 is related to this bug here.
[23 Jun 2009 10:09] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/76888
[22 Jul 2009 7:24] Alexander Barkov
Review comments sent by e-mail:

http://lists.mysql.com/commits/79141
[23 Jul 2009 11:49] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/79147
[29 Jul 2009 18:36] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/79554
[31 Jul 2009 17:56] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/79790
[3 Aug 2009 19:27] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/79959

3052 Gleb Shchepa	2009-08-04
      Additional post-commit fix (bug #30946): server version
      in a conditional commentary at the new SELECT ... INTO
      OUTFILE ... CHARACTER SET syntax has been updated to 5.1.38.
[4 Aug 2009 19:52] Bugs System
Pushed into 5.4.4-alpha (revid:alik@sun.com-20090804194615-h40sa098mx4z49qg) (version source revid:gshchepa@mysql.com-20090803192436-r8nqv6zi0a6did5t) (merge vers: 5.4.4-alpha) (pib:11)
[4 Aug 2009 20:45] Bugs System
Pushed into 5.1.38 (revid:davi.arnaut@sun.com-20090804204317-ggodqkik7de6nfpz) (version source revid:davi.arnaut@sun.com-20090804204317-ggodqkik7de6nfpz) (merge vers: 5.1.38) (pib:11)
[18 Aug 2009 18:11] Paul DuBois
Noted in 5.1.38, 5.4.4 changelogs.

Previously, SELECT ... INTO OUTFILE dumped column values without
character set conversion, which could produce data files that cannot
be imported without error if different columns used different 
character sets. A consequence of this is that mysqldump ignored the
--default-character-set option if the --tab option was given (which
causes SELECT ... INTO OUTFILE to be used to dump data.)

INTO OUTFILE now can be followed by a CHARACTER SET clause indicating
the character set to which dumped values should be converted. Also,
mysqldump adds a CHARACTER SET clause to the SELECT ... INTO OUTFILE
statement used to dump data, so that --default-character-set is no
longer ignored if --tab is given.

Other changes are that SELECT ... INTO OUTFILE enforces that ENCLOSED
BY and ESCAPED BY arguments must be a single character, and SELECT
... INTO OUTFILE and LOAD DATA INFILE produce warnings if non-ASCII
field or line separators are specified.
[1 Oct 2009 5:59] Bugs System
Pushed into 5.1.39-ndb-6.3.28 (revid:jonas@mysql.com-20091001055605-ap2kiaarr7p40mmv) (version source revid:jonas@mysql.com-20091001055605-ap2kiaarr7p40mmv) (merge vers: 5.1.39-ndb-6.3.28) (pib:11)
[1 Oct 2009 7:25] Bugs System
Pushed into 5.1.39-ndb-7.0.9 (revid:jonas@mysql.com-20091001072547-kv17uu06hfjhgjay) (version source revid:jonas@mysql.com-20091001071652-irejtnumzbpsbgk2) (merge vers: 5.1.39-ndb-7.0.9) (pib:11)
[1 Oct 2009 13:25] Bugs System
Pushed into 5.1.39-ndb-7.1.0 (revid:jonas@mysql.com-20091001123013-g9ob2tsyctpw6zs0) (version source revid:jonas@mysql.com-20091001123013-g9ob2tsyctpw6zs0) (merge vers: 5.1.39-ndb-7.1.0) (pib:11)
[5 Oct 2009 10:50] Bugs System
Pushed into 5.1.39-ndb-6.2.19 (revid:jonas@mysql.com-20091005103850-dwij2dojwpvf5hi6) (version source revid:jonas@mysql.com-20090930185117-bhud4ek1y0hsj1nv) (merge vers: 5.1.39-ndb-6.2.19) (pib:11)
[5 Oct 2009 18:19] Paul DuBois
This fix now has been pushed into 5.4.2.
[22 Apr 2010 22:07] Ryan Mack
I tracked a new bug back to this one (http://bugs.mysql.com/bug.php?id=53088).  It appears the new charset conversion code truncates all converted text/blob columns to MAX_FIELD_WIDTH (766) bytes.  I am not sure whether it is desirable to convert binary format columns to the default charset at all.