MySQL Bugs: #11212: Bugs in the conversion-into-utf8 process

Bug #11212	Bugs in the conversion-into-utf8 process
Submitted:	9 Jun 2005 16:11	Modified:	1 Dec 2005 8:54
Reporter:	Jose Alejandro Guizar	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server	Severity:	S4 (Feature request)
Version:	4.1.x	OS:	Linux (Debian Linux 3.0)
Assigned to:	Alexander Barkov	CPU Architecture:	Any

Description:
There's been some bug reports regarding the way mysql corrupts data entered into a utf8 table but only for *some* characters, not all. Most of these reports have been clasified as not-a-bug since there's a workaround to this issue by manipulating the character_set_* variables. But it's only a workaround, and there really is a bug. Most people run into this bug because of the terribly understated importance of the character_set_* variables and charset-client-server dinamics. Mostly that the settings you set are not always respected. 
So then what's the point of having the character_set* variables if the client
*always* gets to override them? They'd be better off being upgraded to
forced_character_set* variables, so the server doesn't get coerced into doing
something which the users couldn't have seen coming (at least not before they
put 12-20 man-hours research into it). 

At least we have a workaround with SET NAMES * so we can force the connection
charset client-side; mostly because most users will be using something other
than the standard client (which, by the looks of it during this conversation,
doesn't always play nice, either) to connect and make their queries with. (like
perl DBI, PHP or what-have-you). 

So with said workaround we can go back to doing our development, but that still
leaves the BIG-ASS-ELEPHANT-SIZED bug within mysql's internal encoding
conversion engine (or whatever is in charge of doing the conversions between
charsets). The point of having a utf-8 table (or DB) is that you can store data
in it from different languages and different legacy charsets (cp1250, latin1,
shift-JIS, greek, czech, ciryllic, whatever), right? So, SUPPOSEDLY, mysql can
convert from any of those other charsets into utf-8 right before it puts said
data into the table or DB (of course, granted *you* tell it what charset the
data you are sending is encoded in). And it sounds perfect but that's where the
bugs come in.

How to repeat:
Simple example, create a table (I have 4.1.11-4 debian but I think
all 4.1.x have it, maybe the ones before that, too):

 create table utf ( data varchar(10) ) charset =utf8;

and then (using the regular client) we tell the server we will be sending in
latin1 data (so it know s what to convert from and into utf8):

 set names 'latin1';

now, just for reference, we insert a couple of simple chars:

 insert into utf values(char(0x5b));

 insert into utf values(char(0x7f));

and check what actually got into the db:

 select data,hex(data) from utf;

+------+-----------+
| data | hex(data) |
+------+-----------+
| [    | 5B        |
|     | 7F        |
+------+-----------+

Ok, that was flawless since those codepoints are encoded the same both in latin1
and utf8.
Now let's try some chars where the server will have to encode them:

 insert into utf values(char(0xc1));

 insert into utf values(char(0xf9));

 select data,hex(data) from utf;

+------+-----------+
| data | hex(data) |
+------+-----------+
| [    | 5B        |
|     | 7F        |
| Á     | C381      |
| ù     | C3B9      |
+------+-----------+

(we are not interested now in the 'data' column, since you'll different chars or
none at all depending also on your xterm configuration)

BEAUTIFUL! we see that mysql did the right encoding for 
0xC1 (latin1)  --->  0xC381 (utf8)     and
0xF9 (latin1)  --->   0xC3B9 (utf8)

(if you want to verify this, here's one reference to save you the trip to
google
 http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
sorry I didn't hyperlink it)

Ok, now let's try another one:

 insert into utf values(char(0x81));

 select data,hex(data) from utf;
+------+-----------+
| data | hex(data) |
+------+-----------+
| [    | 5B        |
|     | 7F        |
|     | C381      |
|      | C3B9      |
| ?    | 3F        |
+------+-----------+

what the...? 
0x81 (latin1)  --->  0x3F (utf8) ??? and not 0xC280 like Perl says?

ok, surely it was a fluke, let's try another one:

 insert into utf values(char(0x89));

 select data,hex(data) from utf;

+------+-----------+
| data | hex(data) |
+------+-----------+
| [    | 5B        |
|     | 7F        |
|      | C381      |
|      | C3B9      |
| ?    | 3F        |
|      | E280B0    |
+------+-----------+

Now that last one I can't even begin to fathom what the hell the server _tried_
to do.

Do you see now?

Suggested fix:
And that's the reason for all the 'Some (not ALL) utf-8 chars getting corrupted'
bugs from different people with different charsets, including bugs #8672, 8730,
8943, 8949, 8973, 9046, 9091, 9146, 9269, 10812 and this one (and only because around
the time of #9269 I figured the problem out and stopped looking for them; but,
by the looks of it, there's one bug report about this about every 200 bug
reports or so).  What was happening was that, for example, when I tried
inserting into the DB (through a perl app) the char 'Á' (hope you see it, it's
an 'A' with an acute accent), perl sent the binary data 0xC381 (which is this
char encoded in utf-8) to mysql, but due to the character_set* variables
"issue", mysql thought it was receiving latin1 data through that connection, so
it proceeded to encode into utf-8 the already encoded utf-8 data. So it would
take the first byte (0xC3) and encode it -correctly- into 0xC383, followed by
the encoding for the second byte (0x81) which, as I just showed you about 2
examples ago, gets incorrectly encoded into 0x3F instead of 0xC280. Now the data
has been corrupted and there's no way in hell we're gonna get back from the DB
what we put in there.

Had mysql server not messed up this conversion, and had instead transparently
encoded to and from utf8, it would have taken a very curious individual to
figure out that, in fact, they were using almost twice the space for their data
inside the DB and why (doubly-utf8-encoded data!). And everybody would've kept
on going with their development never realizing that having the right settings
set during a connection does matter.

Now, so I can get some peace of mind, is it or is it not a BUG? 

It has to be, since, even though in all of the bugs about this I checked all of
us were having the same issue (some chars getting garbled/eaten/corrupted)
because we were sending utf8 data into a utf8 table, in case I *wanted* to have
an app in which I send latin1 data to an utf8 table, with every setting properly
configured, and, say, for example, tried to store every char with a value from
0x00 to 0xff, it would blow up in my face.

Thx for reading this far, hope you don't ignore it like most of the other bug
reports I mentioned.

And please, for the love of god, don't come back with "well, latin1 chars with
values between 0x80 and 0xA0 aren't REALLY chars, so that's why the server
doesn't encode them properly" 'cause that's also the case for latin1 chars with
values between 0x00 and 0x1f, and those don't get to take a trip to the twilight
zone when entered into the db.

[/quasi-rant]

heh

This problem was fixed in one on the previous releases.
latin1-utf8-latin1 round trip is now safe.

For example, 

mysql> select hex(@a:=convert(_latin1 0x818D8F909D using utf8)) as to_utf8, hex(convert(@a using latin1)) as from_utf8;
+----------------------+------------+
| to_utf8              | from_utf8  |
+----------------------+------------+
| C281C28DC28FC290C29D | 818D8F909D |
+----------------------+------------+
1 row in set (0.00 sec)