MySQL Bugs: #1994: mysqldump does not correctly dump UCS2 data

Bug #1994	mysqldump does not correctly dump UCS2 data
Submitted:	2 Dec 2003 5:39	Modified:	9 Dec 2003 5:50
Reporter:	Simon Detheridge	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: mysqldump Command-line Client	Severity:	S3 (Non-critical)
Version:	4.1.1alpha from bk	OS:	Linux (linux (gentoo))
Assigned to:	Ramil Kalimullin	CPU Architecture:	Any

Description:
mysqldump does not correctly output ucs2 data. The output created by mysqldump 
is in latin1 (presumably because this is how I compiled mysql) but when I dump 
a table that is UCS2, any letters that are not in the latin1 charset are 
output as '?' characters. 
 
So, if I have a UCS2 table containing hebrew characters, in a latin1 
installation of mysql, I can't dump the data using mysqldump. 

How to repeat:
Create a database. (mysql is compiled with latin1 as default) 
Create a ucs2 table. 
Add a hebrew word (I used the hebrew for 'maybe', I think!) 
Quit mysql 
run mysqldump to dump the database 
pass the dumped file back to mysql 
the database now contains incorrect data. 
 
The following commands show what I did in detail: 
 
$ mysql testdb 
 
mysql> CREATE TABLE test (testid INT(11) NOT NULL AUTO_INCREMENT, testdata 
VARCHAR(255) NOT NULL, PRIMARY KEY (`testid`)) DEFAULT CHARSET ucs2 ; 
 
Query OK, 0 rows affected (0.02 sec) 
 
mysql> INSERT INTO test (testdata) VALUES (_ucs2 x'05D005D505DC05D9') ; 
 
Query OK, 1 row affected (0.15 sec) 
 
mysql> quit 
 
Bye 
 
$ mysqldump testdb --add-drop-table > ~/testdb_out.sql 
 
#the above file contains '????' as the data 
 
$ cat ~/testdb_out.sql | mysql testdb 
$ mysql testdb 
 
mysql> SELECT HEX(testdata) FROM test ; 
 
+------------------+ 
| HEX(testdata)    | 
+------------------+ 
| 003F003F003F003F | 
+------------------+ 
1 row in set (0.00 sec) 
 
Note the above is 003F003F003F003F (003f == '?') instead of what I entered, 
which was '05D005D505DC05D9' 

Suggested fix:
When outputting data in a single-byte (e.g. latin1) format, and the field is 
UCS2, use hex notation as I used in my example, instead of inserting broken 
characters into the output.

I followed exactly as noted in the bug and it demostrated the same behavior with a bk from Nov 30 2003.

This is not a bug. As soon as you have fields in different character sets,
you should use utf8 to dump, i.e.:

mysqldump --default-character-set=utf8 test

Please try and give feedback.

I've checked this and it works. Sorry... I did look for a default-charset 
option in mysqldump. Must have been blind. 
 
However, I do think that this could still be classified as a bug. Certainly, 
it would be 'better' behaviour of the tool to preserve the data by converting 
it to hex notation, seeing as this is possible. - Or at least output a warning 
when it comes across a unicode field in a non-compatible charset, so the user 
knows that data may be being lost. 
 
Otherwise, I can forsee a situation where an administrator for a number of 
databases, who isn't exactly sure what his users have put there, might get 
caught out by this and end up trashing someone's data.

Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

Now mysqldump's default charset is utf8. One can change it using --default-character-set option.