Bug #26711 Binary content 0x00 sometimes becomes 0x5C 0x00 after dump/load
Submitted: 28 Feb 2007 5:40 Modified: 30 Jul 2007 10:44
Reporter: Yuan WANG Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:5.0.30, 4.1, 5.1, 5.2-falcon OS:Linux (linux)
Assigned to: Alexander Barkov CPU Architecture:Any

[28 Feb 2007 5:40] Yuan WANG
Description:
If some bytes in a Blob is 0x00, then if you dump its content and load into the database again, these 0x00 bytes will becomes two bytes 0x5C and 0x00 (literal '\0') if its preceeding byte is greater than 0x80.

This problem only occurs when the charset of database is set to gbk, if it is latin1, there will be no problem. However then I can not import Chinese content.

How to repeat:
CREATE TABLE `test_blob` (
  `a` blob
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

insert into test_blob values(0xEE00);

Then use mysqldump to dump the content of test_blob into a text file using the -T option, and load this file into test_blob again using 'LOAD DATA INFILE' command, the content of test_blob will be 0xEE5C30.
[28 Feb 2007 9:01] Sveta Smirnova
Please do not submit the same bug more than once. An existing bug report already describes this very problem. Even if you feel that your issue is somewhat different, the resolution is likely
to be the same. Because of this, we hope you add your comments to the original bug instead.

Thank you for your interest in MySQL.

Duplicate of Bug #23619
Will be available in next release.
[5 Mar 2007 10:59] Sveta Smirnova
test case

Attachment: bug23619.test (application/octet-stream, text), 395 bytes.

[5 Mar 2007 11:01] Sveta Smirnova
Thank you for the additional information.

Verified using attached test case.
[12 Mar 2007 10:31] Alexander Barkov
This is a long known problem, the source of it is that
the REVERSE SOLIDUS "\" (0x5C), which is used in
MySQL as an escape character, can also be a part
of a multibyte sequence, like 0xEE5C, which encodes
a Chinese ideogram in GBK.

When loading a dump, "mysql" misinterprets EE5C as
as Chinese character, then loads "0" as a DIGIT ZERO 0x30,
instead of the null character 0x00.

There's a --hex-blob parameter which makes "mysqldump"
use HEX encoding for BINARY, VARBINARY and BLOB datatypes.
Please try "mysqldump --hex-blob" and give feedback if
it works for you this way. This parameter is off by default.

But anyway, although the --hex-blob workaround exists,
this is still as a bug. "mysqldump" side must be fixed 
to produce a dump in a safe way, which should further be understood
by "mysql" without misinterpretring the REVERS SOLIDUS character,
even if it occasionally appears after a "multibyte head" byte,
like 0xEE is.
[12 Mar 2007 10:56] Yuan WANG
--hex-blob doesn't work when the -T option is used, i.e. the data is dumped into a data file, not INSERT statements in SQL.
[12 Mar 2007 11:17] Alexander Barkov
Does it work with --hex-blob, without -T ?
[13 Mar 2007 5:57] Yuan WANG
If --hex-blob is used and -T is not used, then no problems.
[9 Apr 2007 9:03] Yuan WANG
Any plan for this bug?
[4 Jun 2007 13:06] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/28057

ChangeSet@1.2513, 2007-06-04 18:06:39+05:00, bar@mysql.com +6 -0
  Bug#26711 "Binary content 0x00 sometimes becomes 0x5C 0x00 after dump/load"
  Problem: "SELECT INTO OUTFILE" created incorrect dumps for BLOBs,
  so "LOAD DATA" later incorrectly interpreted 0x5C as the second
  byte of a multi-byte sequence, instead of escape character.
  Fix: adding escaping of multi-byte heads.
[4 Jun 2007 13:12] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/28058

ChangeSet@1.2513, 2007-06-04 18:12:28+05:00, bar@mysql.com +6 -0
  Bug#26711 "Binary content 0x00 sometimes becomes 0x5C 0x00 after dump/load"
  Problem: "SELECT INTO OUTFILE" created incorrect dumps for BLOBs,
  so "LOAD DATA" later incorrectly interpreted 0x5C as the second
  byte of a multi-byte sequence, instead of escape character.
  Fix: adding escaping of multi-byte heads.
[7 Jun 2007 7:06] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/28257

ChangeSet@1.2513, 2007-06-07 12:05:43+05:00, bar@mysql.com +6 -0
  Bug#26711 "Binary content 0x00 sometimes becomes 0x5C 0x00 after dump/load"
  Problem: "SELECT INTO OUTFILE" created incorrect dumps for BLOBs,
  so "LOAD DATA" later incorrectly interpreted 0x5C as the second
  byte of a multi-byte sequence, instead of escape character.
  Fix: adding escaping of multi-byte heads.
[7 Jun 2007 8:17] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/28268

ChangeSet@1.2513, 2007-06-07 13:16:49+05:00, bar@mysql.com +6 -0
  Bug#26711 "Binary content 0x00 sometimes becomes 0x5C 0x00 after dump/load"
  Problem: "SELECT INTO OUTFILE" created incorrect dumps for BLOBs,
  so "LOAD DATA" later incorrectly interpreted 0x5C as the second
  byte of a multi-byte sequence, instead of escape character.
  Fix: adding escaping of multi-byte heads.
[7 Jun 2007 9:04] Alexander Barkov
Pushed into 5.0.44-rpl
Pushed into 5.1.20-rpl
[21 Jun 2007 20:11] Bugs System
Pushed into 5.0.46
[21 Jun 2007 20:15] Bugs System
Pushed into 5.1.20-beta
[23 Jun 2007 8:14] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Bugfix documented as follows in the 5.0.46 and 5.1.20 changelogs:

          Binary content <literal>0x00</literal> in a
          <literal>BLOB</literal> column sometimes became <literal>0x5C
          0x00</literal> following a dump and reload, which could cause
          problems with data using multi-byte character sets such as
          <literal>GBK</literal> (Chinese). This was due to a problem
          with <literal>SELECT INTO OUTFILE</literal> whereby
          <literal>LOAD DATA</literal> later incorrectly interpreted
          <literal>0x5C</literal> as the second byte of a multi-byte
          sequence rather than as the <literal>SOLIDUS</literal>
          (<quote>\</quote>) character, used by MySQL as the escape
          character. (Bug #26711)
[30 Jul 2007 7:34] Sveta Smirnova
Bug is still repeatable with current versions.
[30 Jul 2007 7:35] Sveta Smirnova
better test case

Attachment: bug26711.test (application/octet-stream, text), 381 bytes.

[30 Jul 2007 7:35] Sveta Smirnova
Wrong result:

drop table if exists t1;
set character_set_database=gbk;
show variables like 'char%';
Variable_name   Value
character_set_client    latin1
character_set_connection        latin1
character_set_database  gbk
character_set_filesystem        binary
character_set_results   latin1
character_set_server    latin1
character_set_system    utf8
character_sets_dir      /users/ssmirnova/src/mysql-5.0/sql/share/charsets/
create table `t1` (
`a` blob
) engine=myisam default charset=latin1;
insert into t1 values(0xee00);
select hex(a) from t1;
hex(a)
EE00
truncate table t1;
load data local infile 'var/t1.txt' into table t1;
select hex(a) from t1;
hex(a)
EE5C30
[30 Jul 2007 10:43] Alexander Barkov
Sveta, this wrong result is expected where character set is latin1.

You must fix the test to use GBK with mysqldump to get correct results:

--exec $MYSQL_DUMP --default-character-set=gbk -T var test
[31 Jul 2007 13:26] Alexander Barkov
> I am a bit confused why character set affects BLOB field. Especcially
> if I don't use option -T with mysqldump no need to specify
> default-character-set.

Because MySQL "LOAD DATA INFILE" processor considers the entire file as having
a single character set.
It always un-escapes all values before inserting them into fields according to
this character set, and field types do no matter.

So, if you need to load a file into GBK database,
then you must make sure that this file is GBK-compatible.
I.e. when running "mysqldump -T" you must specify --default-character-set=gbk.

> May be note in documentation needed?

Yes, I agree, it should be noted.
[31 Jul 2007 15:49] Paul DuBois
I have added this to the LOAD DATA INFILE section:

LOAD DATA INFILE interprets all fields in the file as having the same
character set, regardless of the data types of the columns into which
field values are loaded. For proper interpretation of file contents,
you must ensure that it was written with the correct character set.
For example, if you write a data file with mysqldump -T or by issuing
a SELECT ... INTO OUTFILE statement in mysql, be sure to use a
--default-character-set option with mysqldump or mysql so that output
is written in the character set to be used when the file is loaded
with LOAD DATA INFILE.
[5 Dec 2007 15:43] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/39323

ChangeSet@1.2651, 2007-12-05 16:43:44+01:00, msvensson@pilot.mysql.com +1 -0
  Remove disabling of ctype_big5, bug#26711 has nothing to do with it
[26 Mar 2008 19:00] Bugs System
Pushed into 6.0.5-alpha
[31 Mar 2008 7:56] Bugs System
Pushed into 5.1.24-rc