MySQL Bugs: #77105: Lost data after ALTER TABLE with CHARACTER SET utf8

Bug #77105	Lost data after ALTER TABLE with CHARACTER SET utf8
Submitted:	20 May 2015 13:37	Modified:	6 Apr 2018 8:11
Reporter:	Vlad Safronov	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server	Severity:	S3 (Non-critical)
Version:	5.6.22	OS:	Linux
Assigned to:		CPU Architecture:	Any
Tags:	utf8

Description:
Hi, 

I tried to convert utf8 characters saved in latin1 table using advice from https://dev.mysql.com/doc/refman/5.6/en/alter-table.html and lost data on conversion. 

How to repeat:
See this test case. 

mysql> create table test1 (t1 text) ENGINE=MyISAM DEFAULT CHARSET=latin1;  
Query OK, 0 rows affected (0.00 sec)    

mysql> insert into test1 values('characters ‘’↦ got garbled');
Query OK, 1 row affected, 1 warning (0.00 sec)

mysql> select * from test1;
+--------------------------------+
| t1                             |
+--------------------------------+  
| characters ‘’? got garbled     |
+--------------------------------+
1 row in set (0.00 sec)
  
mysql> ALTER TABLE test1 CHANGE t1 t1 BLOB;
Query OK, 1 row affected (0.00 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> ALTER TABLE test1 CHANGE t1 t1 TEXT CHARACTER SET utf8;
Query OK, 1 row affected, 1 warning (0.01 sec)
Records: 1  Duplicates: 0  Warnings: 1

mysql> select * from test1;
+-------------+
| t1          |
+-------------+
| characters  |
+-------------+
1 row in set (0.01 sec)

mysql>

mysql> show warnings;
+---------+------+--------------------------------------------------------------------+
| Level   | Code | Message                                                            |
+---------+------+--------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\x91\x92? go...' for column 't1' at row 1 |
+---------+------+--------------------------------------------------------------------+
1 row in set (0.00 sec)

First of all, I believe that what MySQL calls latin1 is not what the rest of the world calls latin1 (ISO Latin 1 aka ISO 8859-1). I think that it actually tries to mimic Microsoft Code Page 1252, which Microsoft also calls "Ansi", as opposed to "OEM" which would be known as Code Page 437, the original ROM character set of the IBM PC 5150 and derivatives.

I believe that you were trying to insert the following UTF-8 encoded byte sequences:

0xe28098 U+2018 LEFT SINGLE QUOTATION MARK
0xe28099 U+2019 RIGHT SINGLE QUOTATION MARK
0xe286a6 U+21A6 RIGHTWARDS ARROW FROM BAR

The byte range 0x80 through 0x9f is reserved for control characters in ISO 8859-1. Only the bytes 0xe2 and 0xa6 would be valid ISO Latin 1:

U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
U+00A6 BROKEN BAR

Apparently, the MySQL "latin1" is not exactly the same what GNU Recode thinks is cp1252. GNU Recode namely maps each of the bytes that would be invalid in ISO 8859-1, as follows:

0x80 U+20AC EURO SIGN
0x98 U+02DC SMALL TILDE
0x99 U+2122 TRADE MARK SIGN
0x86 U+2020 DAGGER

This looks reasonable to me, based on what I remember from using Windows.

Did some bytes get garbled already when the data was inserted?
Can you try the following right after the INSERT, to see the raw bytes:

SELECT HEX(t1) FROM test1;

yes, it did:

mysql> insert into test1 values('‘’↦');
Query OK, 1 row affected, 1 warning (0.02 sec)

mysql> select hex(t1) from test1;
+---------+
| hex(t1) |
+---------+
| 91923F  |
+---------+
1 row in set (0.01 sec)

mysql> select * from test1;
+---------+
| t1      |
+---------+
| ‘’?     |
+---------+

unlike when insertion was made into table with utf8 charset:

mysql> insert into test1 values('‘’↦');
Query OK, 1 row affected (0.00 sec)

mysql> select hex(t1) from test1;                                  
+------------------------------------------------------------------+
| hex(t1)                                                          |
+------------------------------------------------------------------+
..
| E28098E28099E286A6                                               |
+------------------------------------------------------------------+
3 rows in set (0.00 sec)

Vlad...  There is no one-size-fits-all for messed up characters.

First, there are about 6 "best practices" to doing it right in the first place, as enumerated here:  http://stackoverflow.com/a/38363567/1766831 .  You have only partially stated whether you followed those practices.

Second, there are at least 4 visible symptoms for 5 underlying problems, as enumerated here:  http://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not...  You got a question mark without truncation, so you have narrowed it down to the first in that list.

Third, there are 5 possible ways to fix the data.  You tried one and it failed, probably further corrupting the data.  See http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases .

I agree with Vlad that the documentation is skimpy and incomplete.  The 3 links above are may latest attempt at rectifying such.  I would be happy to help Oracle fix the documentation.  You are welcome to adapt what I have written.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

Yes, please close this report. I missed the email notice.