Bug #8199 mysqldump dumps mismached utf-8 characters
Submitted: 30 Jan 2005 0:38 Modified: 31 Jan 2005 8:55
Reporter: [ name withheld ] Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: mysqldump Command-line Client Severity:S2 (Serious)
Version:4.1.7-4.1.8a OS:Linux (GNU/Linux)
Assigned to: Sergei Golubchik CPU Architecture:Any

[30 Jan 2005 0:38] [ name withheld ]
Description:
hello,

i found a output mismatch when i use mysqldump on utf8 tables. it might
be my stupidness and ignorance, but i couldn't find any hint how to
get the right results yet. i even checked the bug tracking database.

the following script does not meant to be smart, clever or beauty. it simply
sets up an dbi connect to an debian v4.1.8a mysql server. it writes the input
value of an cgi-form into a file /tmp/t and inserts a row into the table t.

How to repeat:
#!/usr/bin/perl
#
# CREATE TABLE `t` (
#   `h` varchar(64) default NULL
# ) ENGINE=MyISAM DEFAULT CHARSET=utf8;

use CGI;
use DBI;

CGI::ReadParse();
print "Content-type: text/html\n\n";

$db = DBI->connect("DBI:mysql:t","www-data","********");

print "<form method=\"post\"><input name=\"h\"></form>";

if($in{'h'}) {
   $st=$db->prepare("insert into t values ('$in{'h'}')");
   $st->execute();
   open(F,"> /tmp/t");
   print F "insert into t values ('$in{'h'}')\n";
   close(F);
   print "written.\n";
}

now, i dump you my analysis session. what are the commands?
215: shows the written file /tmp/t
216: shows a hex / ascii dump of file /tmp/t
217: opens a mysql session, which selects the content of table t
218: makes an mysql dump on screen and in file /tmp/F
219: hex / ascii dump of file /tmp/F

{215} thomax@borrowed ~ > cat /tmp/t
insert into t values ('\xc3\xbcbel')
{216} thomax@borrowed ~ > od -t x1 -a /tmp/t
                                   :
0000020 6c 75 65 73 20 28 27 c3 bc 62 65 6c 27 29 0a
          l   u   e   s  sp   (   '   C   <   b   e   l   '   )  nl
                                   :
{217} thomax@borrowed ~ > mysql -u www-data -p t
mysql> select * from t;
+-------+
| h     |
+-------+
| \xc3\xbcbel |
+-------+
1 row in set (0.00 sec)

mysql> \q
Bye
{218} thomax@borrowed ~ > mysqldump -u www-data -p t | tee /tmp/F
Enter password:
                                   :
INSERT INTO `t` VALUES ('\xc3\x83\xc2\xbcbel');
                                   :

{219} thomax@borrowed ~ > od -t x1 -a /tmp/F
                                   :
0001560 20 60 74 60 20 56 41 4c 55 45 53 20 28 27 c3 83
         sp   `   t   `  sp   V   A   L   U   E   S  sp   (   '   C etx
0001600 c2 bc 62 65 6c 27 29 3b 0a 55 4e 4c 4f 43 4b 20
          B   <   b   e   l   '   )   ;  nl   U   N   L   O   C   K  sp
                                   :
{220} thomax@borrowed ~ >

as you can see, if you follow the session: this 'u dieresis' or '&uuml'
changed from 0xc3,0xbc to 0xc3,0x83,0xc2,0xbc.

but it's not only the single character, it seems to me as a misunderstood
between mysql and mysqldump or between me and the documentation. please
let me know.
[30 Jan 2005 12:40] Aleksey Kishkin
Was not be able to reproduce it. 

I attached ("Files" section) a small perl script that I used for testing. Could you please to test it on your computer?
[30 Jan 2005 12:41] Aleksey Kishkin
test case. Actually I used sv_SE.utf8 locale when tested it.

Attachment: bug8199.pl (application/octet-stream, text), 309 bytes.

[30 Jan 2005 15:01] [ name withheld ]
i tried the script and here my analysis session:

{232} root@borrowed /tmp # mysqldump t | od -t x1 -a
0001560 60 74 60 20 56 41 4c 55 45 53 20 28 27 c3 85 6e
          `   t   `  sp   V   A   L   U   E   S  sp   (   '   C enq   n
0001600 67 73 74 72 c3 b6 6d 27 29 3b 0a 55 4e 4c 4f 43
          g   s   t   r   C   6   m   '   )   ;  nl   U   N   L   O   C

which is correct. but:

{233} root@borrowed /tmp # mysql -e "select * from t" t | od -t x1 -a
0000000 68 0a c5 6e 67 73 74 72 f6 6d 0a
          h  nl   E   n   g   s   t   r   v   m  nl

which is not correct.

i reversed your script to read things out from table t:

#!/usr/bin/perl

use utf8;
use open ":utf8";
use DBI;

$db = DBI->connect("dbi:mysql:test","root","") or die "Cannot connect";
$st = $db->prepare("SELECT * FROM t");
$st->execute();
while(@row=$st->fetchrow()) {
        print "$row[0]\n";
}
$st->finish();

and executed it:

{242} root@borrowed /tmp # perl n | od -t x1 -a
0000000 c5 6e 67 73 74 72 f6 6d 0a
          E   n   g   s   t   r   v   m  nl

i have no idea what is, 0xc5,0x6e but i know the
translation of latin-1 ascii 0xc5 (Å) into utf-8
should be 0xc3,0x85.

in any case this is a serious problem. utf-8 handling
must be consistent.

thomax
[31 Jan 2005 8:55] Sergei Golubchik
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.mysql.com/documentation/ and the instructions on
how to report a bug at http://bugs.mysql.com/how-to-report.php

Additional info:

This is expected behaviour.

Your client's character set is latin1 (default), so when you insert 0xc3,0xbc, MySQL thinks it's two characters in latin1 charset, and converts them to utf8 to store in the table.

Use  SET NAMES utf8
before connecting (or see the manual, as there are many ways to make your script working)