Bug #40201 Data truncation while converting from latin1 to binary to utf8
Submitted: 21 Oct 2008 10:23 Modified: 22 Oct 2008 15:30
Reporter: Rakesh Kumar Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:5.0.45,5.0.51,5.1.23-ndb-6.2.15-log OS:Linux
Assigned to: CPU Architecture:Any
Tags: characterset issue, data truncation

[21 Oct 2008 10:23] Rakesh Kumar
Description:
I was working to correct the issue of incorrect charset used while data import. column charset was utf8, and application reads using 'utf8', but data was imported using latin1. Because of this, some junk characters were being displayed on screen.

To correct, I changed the column character set to latin1, then to binary and then re-converted to utf8.

This corrected the issue of Junk characters being displayed, but this caused the data to be truncated for some rows...

How to repeat:
Here are the steps to repeat that:

mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> create schema rakesh;
Query OK, 1 row affected (0.00 sec)

mysql> use rakesh;
Database changed
mysql> CREATE TABLE `test` (
    -> `id` int(10) DEFAULT NULL,
    -> `name` text
    -> )ENGINE=InnoDB DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.16 sec)

mysql> insert into test values (1,"Abc® Core™2 processor with XYZ™ technology");
Query OK, 1 row affected (0.02 sec)

mysql> select * from test;
+------+-------------------------------------------------+
| id   | name                                            |
+------+-------------------------------------------------+
|    1 | Abc® Core™2 processor with XYZ™ technology | 
+------+-------------------------------------------------+
1 row in set (0.00 sec)

mysql> alter table test modify name text charset 'latin1';
Query OK, 1 row affected (0.33 sec)
Records: 1  Duplicates: 0  Warnings: 0

--data is fine till there...

mysql> alter table test modify name blob;
Query OK, 1 row affected (0.32 sec)
Records: 1  Duplicates: 0  Warnings: 0

--data is fine till here...

mysql> alter table test modify name text charset 'utf8';
Query OK, 1 row affected, 1 warning (0.33 sec)
Records: 1  Duplicates: 0  Warnings: 0

-- data truncation happened here...

mysql> show warnings;
+---------+------+-------------------------------------------------------------------+
| Level   | Code | Message                                                           |
+---------+------+-------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\xAE Core...' for column 'name' at row 1 | 
+---------+------+-------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> select * from test;
+------+------+
| id   | name |
+------+------+
|    1 | Abc  | 
+------+------+
1 row in set (0.00 sec)
[21 Oct 2008 18:16] Sveta Smirnova
Thank you for taking the time to write to us, but this is not a bug. Please double-check the documentation available at http://dev.mysql.com/doc/ and the instructions on
how to report a bug at http://bugs.mysql.com/how-to-report.php

BLOB fields don't have character set information, so MySQL has no knowledge which character set is used in the name column. You can see it if run SELECT query while name field is BLOB:

CREATE TABLE `test` (
`id` int(10) DEFAULT NULL,
`name` text
)ENGINE=InnoDB DEFAULT CHARSET=utf8;
insert into test values (1,"Abc® Core™2 processor with XYZ™ technology");
select * from test;
id      name
1       Abc® Core™2 processor with XYZ™ technology
alter table test modify name text charset 'latin1';
select * from test;
id      name
1       Abc® Core™2 processor with XYZ™ technology
alter table test modify name blob;
select * from test;
id      name
1       Abc# Core#2 processor with XYZ# technology
select id, convert(name using latin1) from test;
id      convert(name using latin1)
1       Abc® Core™2 processor with XYZ™ technology
alter table test modify name text charset 'utf8';
Warnings:
Warning 1366    Incorrect string value: '\xAE Core...' for column 'name' at row 1
select * from test;
id      name
1       Abc
[22 Oct 2008 10:21] Rakesh Kumar
Thanks for the response. I already went through the documentation...

However, this indeed looks like a bug.

Your response talks about display part, whereas, I am talking about physical data truncation in table.

Conversion from "blob" to "text charset 'utf8'" truncates data. 
Conversion from "blob" to "text charset 'latin1'" works just fine...

In the same step to reproduce:

mysql> alter table test modify name blob;
mysql> alter table test modify name text charset 'latin1'; -- This works
mysql> alter table test modify name blob;
mysql> alter table test modify name text charset 'utf8'; --This does not work and truncates data
mysql> show warnings;

+---------+------+-------------------------------------------------------------------+
| Level   | Code | Message                                                           |
+---------+------+-------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\xAE Core...' for column 'name' at row 1 | 
+---------+------+-------------------------------------------------------------------+
1 row in set (0.00 sec)

Looking at the warning...\xAE is a valid character in both latin1 and utf8 character set.

I understand that binary column does not have any character set information, so the conversion may not be successful in case of characters displayed using higher bit (ascii value > 127) being present in the original content. But why would the data get truncated?

Thanks.
[22 Oct 2008 11:09] Rakesh Kumar
To add further...when I alter the table to use "memory" storage engine and use varchar/varbinary instead of text/blob, the same conversion works just fine...

Looks like the issue is only with Disc based tables (I tried with myisam and innodb...).

Thanks
[22 Oct 2008 11:11] Rakesh Kumar
Looks like the issue was closed prematurely...re-opening it.

Thanks.
[22 Oct 2008 11:46] Rakesh Kumar
My apologies...the issue comes on memory table also...please ignore my comment on issue not coming on memory tables and only on disc  based tables.
[22 Oct 2008 15:30] Sveta Smirnova
Thank you for the feedback.

Please look at the output:

alter table test modify name blob;
select * from test;
id      name
1       Abc# Core#2 processor with XYZ# technology

You can see field name is not   "Abc® Core™2 processor with XYZ™ technology", but contains garbage characters. These characters are rejected when you convert this field to utf8.

So I close the report as "Not a Bug" again.