Bug #1994 mysqldump does not correctly dump UCS2 data
Submitted: 2 Dec 2003 5:39 Modified: 9 Dec 2003 5:50
Reporter: Simon Detheridge Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: mysqldump Command-line Client Severity:S3 (Non-critical)
Version:4.1.1alpha from bk OS:Linux (linux (gentoo))
Assigned to: Ramil Kalimullin CPU Architecture:Any

[2 Dec 2003 5:39] Simon Detheridge
Description:
mysqldump does not correctly output ucs2 data. The output created by mysqldump 
is in latin1 (presumably because this is how I compiled mysql) but when I dump 
a table that is UCS2, any letters that are not in the latin1 charset are 
output as '?' characters. 
 
So, if I have a UCS2 table containing hebrew characters, in a latin1 
installation of mysql, I can't dump the data using mysqldump. 

How to repeat:
Create a database. (mysql is compiled with latin1 as default) 
Create a ucs2 table. 
Add a hebrew word (I used the hebrew for 'maybe', I think!) 
Quit mysql 
run mysqldump to dump the database 
pass the dumped file back to mysql 
the database now contains incorrect data. 
 
The following commands show what I did in detail: 
 
$ mysql testdb 
 
mysql> CREATE TABLE test (testid INT(11) NOT NULL AUTO_INCREMENT, testdata 
VARCHAR(255) NOT NULL, PRIMARY KEY (`testid`)) DEFAULT CHARSET ucs2 ; 
 
Query OK, 0 rows affected (0.02 sec) 
 
mysql> INSERT INTO test (testdata) VALUES (_ucs2 x'05D005D505DC05D9') ; 
 
Query OK, 1 row affected (0.15 sec) 
 
mysql> quit 
 
Bye 
 
$ mysqldump testdb --add-drop-table > ~/testdb_out.sql 
 
#the above file contains '????' as the data 
 
$ cat ~/testdb_out.sql | mysql testdb 
$ mysql testdb 
 
mysql> SELECT HEX(testdata) FROM test ; 
 
+------------------+ 
| HEX(testdata)    | 
+------------------+ 
| 003F003F003F003F | 
+------------------+ 
1 row in set (0.00 sec) 
 
Note the above is 003F003F003F003F (003f == '?') instead of what I entered, 
which was '05D005D505DC05D9' 

Suggested fix:
When outputting data in a single-byte (e.g. latin1) format, and the field is 
UCS2, use hex notation as I used in my example, instead of inserting broken 
characters into the output.
[2 Dec 2003 13:53] Boyd Gerber
I followed exactly as noted in the bug and it demostrated the same behavior with a bk from Nov 30 2003.
[3 Dec 2003 2:27] Alexander Barkov
This is not a bug. As soon as you have fields in different character sets,
you should use utf8 to dump, i.e.:

mysqldump --default-character-set=utf8 test

Please try and give feedback.
[3 Dec 2003 3:08] Simon Detheridge
I've checked this and it works. Sorry... I did look for a default-charset 
option in mysqldump. Must have been blind. 
 
However, I do think that this could still be classified as a bug. Certainly, 
it would be 'better' behaviour of the tool to preserve the data by converting 
it to hex notation, seeing as this is possible. - Or at least output a warning 
when it comes across a unicode field in a non-compatible charset, so the user 
knows that data may be being lost. 
 
Otherwise, I can forsee a situation where an administrator for a number of 
databases, who isn't exactly sure what his users have put there, might get 
caught out by this and end up trashing someone's data.
[9 Dec 2003 5:50] Ramil Kalimullin
Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

Now mysqldump's default charset is utf8. One can change it using --default-character-set option.