Bug #22719 Dataloss during conversion from varbinary to utf8 varchar
Submitted: 27 Sep 2006 0:09 Modified: 1 Nov 2006 17:36
Reporter: Max Kanat-Alexander Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Documentation Severity:S2 (Serious)
Version:4.1.20 OS:Linux (Linux (RHEL4))
Assigned to: Paul DuBois CPU Architecture:Any

[27 Sep 2006 0:09] Max Kanat-Alexander
Description:
I have a varbinary(255) that contains ISO-8859-7 (Greek) encoded bytes. It also contains some ASCII bytes.

When I convert it to utf8, MySQL actually *deletes* the ISO-8859-7 bytes. The ASCII bytes are still there.

I expected it to keep the bytes, even if they were invalid UTF-8, since that's what the MySQL docs say will happen.

How to repeat:
CREATE TABLE test_table ( test_column varbinary(64) );

--The following are the valid greek characters: ΐΑΒ in iso-8859-7
INSERT INTO test_table (test_column) VALUES (CONCAT('ISO-8859-7: ', CHAR(0xC0), CHAR(0xC1), CHAR(0xC2)));

SELECT * FROM test_table;

ALTER TABLE test_table CHANGE COLUMN test_column test_column varchar(64) CHARACTER SET utf8;

SELECT * FROM test_table;

Note that the Greek bytes were *deleted*.
[28 Sep 2006 7:58] Sveta Smirnova
Thank you for the report.

I think there is no way to indicate if symbols should be greek or not if type of the column is binary. But it is not properly documentd.

There is workaround:

ALTER TABLE test_table CHANGE COLUMN test_column test_column varchar(64)
CHARACTER SET greek;

ALTER TABLE test_table CHANGE COLUMN test_column test_column varchar(64)
CHARACTER SET utf8;
[28 Sep 2006 9:18] Max Kanat-Alexander
Okay. The only problem is that I'm converting a database where the columns have various encodings in the varbinary, not just iso-8859-7. So there's no way to maintain that data and still convert to UTF-8? I don't know the encodings of each value, so I just wanted to preserve them as binary bytes.
[1 Nov 2006 17:34] Paul DuBois
If a column contains data in a mix of encodings, you cannot convert
it properly. MySQL cannot tell which encodings are used for which
parts of the data.
[1 Nov 2006 17:36] Paul DuBois
Thank you for your bug report. This issue has been addressed in the documentation. The updated documentation will appear on our website shortly, and will be included in the next release of the relevant products.

I am adding a section on charset to conversion to the manual. It
will appear here:

http://dev.mysql.com/doc/refman/5.0/en/charset-conversion.html

The section will say:

To convert a binary or non-binary string column to use a particular
character set, use ALTER TABLE. For successful conversion to occur,
one of the following conditions must apply:

- If the column has a binary data type (BINARY, VARBINARY, BLOB),
all the data it contains must be encoded using a single character
set (the character set you're converting the column to).  If you
use a binary column to store information in multiple character sets,
MySQL has no way to know which values use which character set and
cannot convert the data properly.

- If the column has a non-binary data type (CHAR, VARCHAR, TEXT),
its contents should be encoded in the column's character set, not
some other character set.  If the contents are encoded in a different
character set, you can convert the column to use a binary data type
first, and then to a non-binary column with the desired character
set.

Suppose that a table t has a binary column named col1 defined as
BINARY(50).  Assuming that the information in the column is encoded 
using a single character set, you can convert it to a non-binary
column that has that character set. For example, if col1 contains
binary data representing characters in the greek character set, you
can convert it as follows:

ALTER TABLE t MODIFY col1 CHAR(50) CHARACTER SET greek;

Suppose that table t has a non-binary column named col1 defined as
CHAR(50) CHARACTER SET latin1 but you want to convert it to use
utf8 so that you can store values from many languages. The following
statement accomplishes this:

ALTER TABLE t MODIFY col1 CHAR(50) CHARACTER SET utf8;

Conversion may be lossy if the column contains characters that are
not in both character sets.

A special case occurs if if you have old tables from MySQL 4.0 or
earlier where a non-binary column contains values that actually are
encoded in a character set different from the server's default
character set.  For example, an application might have stored utf8
values in a column, even though MySQL's default character set was
latin1.  It is possible to convert the column to use the proper 
character set but an additional step is required.  Suppose that the
server's default character set was latin1 and col1 is defined as
CHAR(50) but its contents are sjis values.  The first step is to
convert the column to a binary data type, which removes the existing
character set information without performing any character conversion:

ALTER TABLE t MODIFY col1 BINARY(50);

The next step is to convert the column to a non-binary data type with
the proper character set:

ALTER TABLE t MODIFY col1 CHAR(50) CHARACTER SET sjis;

This procedure requires that the table not have been modified already
with statements such as INSERT or UPDATE after an upgrade to MySQL
4.1 or later. In that case, MySQL would store new values in the
column using latin1, and the column will contain a mix of sjis and
latin1 values and cannot be converted properly.
[27 Jan 2007 23:38] Andy Staudacher
We are deploying a software upgrade and need to detect which of the two (or three) cases we deal with such that we choose the correct upgrade path.

Case A: "old tables from MySQL 4.0 or earlier where a non-binary column contains values that actually are encoded in a character set different from the server's default character set."
-> convert VARCHAR to VARBINARY -> convert to VARCHAR UTF8

Case B: "same but MySQL 4.1 or later"
-> convert VARCHAR to VARCHAR UTF8 directly

Case C: "starting with case A but there have been INSERTs/UPDATEs after the MySQL upgrade to 4.1"
-> what to do here to minimize data loss?

We tried to store UTF8 in the database even in pre MySQL 4.1 databases. What we do know during a software upgrade is converting the db, tables and columns to UTF8 applying the method from case A in all cases. But this leads to truncation in some cases (probably for case B) and that's why we need to improve the upgrade path.

Questions:
1. How can we detect the 3 cases?
We don't know the history of the specific installations since all we do is offer the software. The software should auto-detect the scenario and run the appropriate upgrade code.

2. Do you have any advice for case C?

Thank you very much!