Bug #12123 mysqldump --tab results in text file which can't be imported
Submitted: 22 Jul 2005 20:39 Modified: 31 Oct 2005 19:46
Reporter: Peter Zaitsev (Basic Quality Contributor) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:4.1 OS:Any (all)
Assigned to: Bugs System CPU Architecture:Any

[22 Jul 2005 20:39] Peter Zaitsev
Description:
If you have tables with different character sets in database you will have trouble dumping this table and restoring it by using myqldump --tab     (standard mysqldump works)

The reason is  LOAD DATA INFILE  only  looks at database character set while loading the data, so 
if you have  utf8  database with latin1  table    data will be attempted to load to this table using utf8 character set which  may fail or cause invalid data to be loaded

I assume even worse may happen if data is in different character sets in different columns. 

How to repeat:
1) create database having utf8 character set

2) create table   with default character set latin1 in this table.

3) Populate it with some data with high latin1 characters  (which prsented diferently in utf8)

4) attempt to dump and reload. 

Suggested fix:
I guess at least table character set should be taken into account. Full fix perhaps would be reading target table to get character sets which have to be used for each fields. 

Also some option should be added to specify character set of source/target file
[22 Jul 2005 20:50] Jorge del Conde
Thanks for your bug report Peter
[14 Oct 2005 23:49] Patrick Galbraith
I can't seem to get a problem with this. I created a utf8 db, latin1 table, inserted two rows, one with regular US text, one with German, dumped, and reloaded, and the data is the same before and after. 

mysql> create database udb character set utf8;
Query OK, 1 row affected (0.01 sec)

mysql> connect udb;
Connection id:    13
Current database: udb

mysql> create table lat(a int, b varchar(32)) CHARACTER SET=latin1;
^[[AQuery OK, 0 rows affected (0.46 sec)

mysql> insert into lat values (1, 'this is a test');
Query OK, 1 row affected (0.02 sec)

mysql> insert into lat values (2, 'über nachfaßt');
Query OK, 1 row affected (0.01 sec)

mysql> select * from lat;
+------+----------------+
| a    | b              |
+------+----------------+
|    1 | this is a test |
|    2 | über nachfaßt  |
+------+----------------+
2 rows in set (0.01 sec)

mysql> \q
Bye
patg@krsna:~/mysql-build/mysql-4.1/client> ./mysqldump -uroot -S /tmp/mysqld-4.1-5552.sock --add-drop-database --databases udb > udb.dump
patg@krsna:~/mysql-build/mysql-4.1/client> ./mysql -uroot -S /tmp/mysqld-4.1-5552.sock udb < udb.dump 
patg@krsna:~/mysql-build/mysql-4.1/client> ./mysql -u root -S /tmp/mysqld-4.1-5552.sock udb
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 16 to server version: 4.1.16-debug-log

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> select * from lat;
+------+----------------+
| a    | b              |
+------+----------------+
|    1 | this is a test |
|    2 | über nachfaßt  |
+------+----------------+
2 rows in set (0.01 sec)
[15 Oct 2005 2:20] Peter Zaitsev
Patrick,

In your case database character is utf8  while table character is latin1. 
Latin1  letters typically have the same codes in utf8 this is why you're lucky. 

You also used the phrase which could be presented in latin1 character set which is always poor test.

I would suggest you to create latin1  database and  utf8  table and test something which can't be presented in latin1, for example russian  -  "Тестовая Строка"    - if  you need some sample text.
[15 Oct 2005 3:53] Patrick Galbraith
Peter,

In the report, you said "Populate it with some data with high latin1 characters". By  "high latin", do you mean "cyrillic"? I'll give that a try. Question though: If a table is supposed to be latin1, is it suposed to transparently deal with inserting cyrillic characters, dumping, and reloading? What does making a table latin1 in a utf8 database mean - as far as moving data around?

Thanks much!
[15 Oct 2005 4:02] Peter Zaitsev
Patrick, 

Sorry I was unclear in this case.  I just repeated this bug with 4.1.14. Here is full session:

mysql> create database xx character set latin1;
Query OK, 1 row affected (0.06 sec)

mysql> use xx;
Database changed
mysql> create table  tbl(c varchar(255)) character set utf8;
Query OK, 0 rows affected (0.01 sec)

mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into tbl values ("тест тест тест");
Query OK, 1 row affected (0.00 sec)

mysql> select * from tbl;
+----------------------------+
| c                          |
+----------------------------+
| тест тест тест             |
+----------------------------+
1 row in set (0.00 sec)

mysql> select * from tbl into outfile '/tmp/bug';
Query OK, 1 row affected (0.00 sec)

mysql> load data infile '/tmp/bug' into table tbl;
Query OK, 1 row affected (0.01 sec)
Records: 1  Deleted: 0  Skipped: 0  Warnings: 0

mysql> select * from tbl;
+----------------------------------------------------------+
| c                                                        |
+----------------------------------------------------------+
| тест тест тест                                           |
| теÑÑ‚ теÑÑ‚ теÑÑ‚                               |
+----------------------------------------------------------+
2 rows in set (0.00 sec)

As you can see the data which was loaded back in the table is different than what was offloaded from this table :)
[15 Oct 2005 4:26] Patrick Galbraith
Cutting the cyrillic from the email just gives an emptry string, so I saved the file to disk, then used vim and cut and pasted between xterms:

mysql> insert into lat values (3, '–¢–µ—~A—~B–æ–≤–∞—~O –°—~B—~@–æ–∫–∞');
Query OK, 1 row affected, 1 warning (0.01 sec)

mysql> select * from lat;
+------+----------------------------------+
| a    | b                                |
+------+----------------------------------+
|    1 | this is a test                   |
|    2 | ¸ber nachfaflt                    |
|    3 | –¢–µ—~A—~B–æ–≤–∞—~O –°—~B—~@–æ–∫ |
+------+----------------------------------+
3 rows in set (0.02 sec)

mysql> \q
Bye
patg@krsna:~/mysql-build/mysql-4.1/client> mysqldump -uroot -S /tmp/mysqld-4.1-5552.sock --add-drop-database --databases udb > udb.dump
patg@krsna:~/mysql-build/mysql-4.1/client> vi udb.dump 
patg@krsna:~/mysql-build/mysql-4.1/client> ./mysql -u root -S /tmp/mysqld-4.1-5552.sock udb < udb.dump 
patg@krsna:~/mysql-build/mysql-4.1/client> ./mysql -u root -S /tmp/mysqld-4.1-5552.sock udb
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 43 to server version: 4.1.16-debug-log

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> select * from lat;
+------+----------------------------------+
| a    | b                                |
+------+----------------------------------+
|    1 | this is a test                   |
|    2 | ¸ber nachfaflt                    |
|    3 | –¢–µ—~A—~B–æ–≤–∞—~O –°—~B—~@–æ–∫ |
+------+----------------------------------+
3 rows in set (0.01 sec)

I don't know what the trick is to cutting cyrillic from an email and inserting it into the db via xterm. I do see that mysqldump handles what I pasted. 

thanks!
[15 Oct 2005 4:28] Patrick Galbraith
So, looking at how you did it, perhaps I need to figure out how to type cyrillic. I've always used cut-n-paste, and don't know the key mappings.
[15 Oct 2005 4:59] Peter Zaitsev
Here is the hint if you want to avoid typing:

select unhex('D182D0B5D181D18220D182D0B5D181D18220D182D0BAD0B5D181D182');

Just make sure to do "set names utf8" and make sure your console is utf8
[15 Oct 2005 8:27] Sergei Golubchik
It's a duplicate of BUG#6012.

And in any case it's not a bug, the manual documents it as
====
As of MySQL 4.1, the character set indicated by the `character_set_database' system variable is used to interpret the information in the file. `SET NAMES' and the setting of `character_set_client' do not affect input interpretation.
Note that it's currently not possible to load `UCS2' data files.
====
[15 Oct 2005 20:29] Sergei Golubchik
Ok, Peter explained in email that the bug here is that mysqldump --tab/mysqlimport can result in incorrect data. So this backup strategy does not work anymore.

This is trivial to fix - mysqlimport should do

set @@character_set_database=binary;

before loading the data.
[15 Oct 2005 20:39] Sergei Golubchik
Test case:

create table t1 (a text character set utf8, b text character set latin1);
insert t1 values (0x4F736E616272C3BC636B, 0x4BF66C6E);

mysqldump --tab=. test
mysql < t1.sql
mysqlimport test t1.txt
[15 Oct 2005 21:03] Patrick Galbraith
I'll work on both of these bugs per your suggested fix.
[20 Oct 2005 0:38] Patrick Galbraith
I made a typo on the bug commit, and so I can't find my patch in lists.mysql.com, however:

ChangeSet
  1.2453 05/10/19 17:27:02 patg@krsna.patg.net +1 -0
  BUG# 12123 
  
  Made change to mysqlimport to set character_set_database to binary to make importing
  various charsets/columns work correctly.

  client/mysqlimport.c
    1.52 05/10/19 17:25:56 patg@krsna.patg.net +6 -0
    BUG# 12132
    Added 'set @@character_set_database=binary' to make loading of tables with mixed 
    charset types and non-latin characters load.

# This is a BitKeeper patch.  What follows are the unified diffs for the
# set of deltas contained in the patch.  The rest of the patch, the part
# that BitKeeper cares about, is below these diffs.
# User:	patg
# Host:	krsna.patg.net
# Root:	/home/patg/mysql-build/mysql-4.1

--- 1.51/client/mysqlimport.c	2005-09-07 15:49:57 -07:00
+++ 1.52/client/mysqlimport.c	2005-10-19 17:25:56 -07:00
@@ -304,6 +304,12 @@
       fprintf(stdout, "Loading data from SERVER file: %s into %s\n",
 	      hard_path, tablename);
   }
+  if (mysql_query(sock, "set @@character_set_database=binary;"))
+  {
+    db_error_with_table(sock, tablename);
+    DBUG_RETURN(1);
+  }
+
   sprintf(sql_statement, "LOAD DATA %s %s INFILE '%s'",
 	  opt_low_priority ? "LOW_PRIORITY" : "",
 	  opt_local_file ? "LOCAL" : "", hard_path);
[25 Oct 2005 21:51] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/31473
[31 Oct 2005 16:23] Patrick Galbraith
fixed in 5.0.16 bk
[31 Oct 2005 19:46] Paul DuBois
Noted in 5.0.16 changelog.