MySQL Bugs: #10812: Mysql internal problem with UTF-8 encoding of some characters (&#269)

Bug #10812	Mysql internal problem with UTF-8 encoding of some characters (&#269)
Submitted:	23 May 2005 23:49	Modified:	6 Jun 2005 11:02
Reporter:	James Tomek	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Compiling	Severity:	S2 (Serious)
Version:	4.1.12-nt	OS:	Windows (WinXP)
Assigned to:		CPU Architecture:	Any

Description:
MySql seems to default incoming strings to latin1 even though you specify the default-character-set=utf8. This or something else (?) causes problems when inserting utf-8 characters into the database.

When you insert a Czech language characters into mysql database, all characters work fine but two: (Czech letter c and g with accent) These two characters get stored in the database incorrectly even though you specify all system variables to utf-8.

Here is some info about it:

Czech character c with accent:
unicode:
 decimal 269
 hex 010D
utf-8:
 decimal: 196 141
 hex C48D

If you take this character and post it into your utf-8 MySql database, it gets saved as hex C43F. This translates into unicode as decimal 79, hex 4F which is a totally different character from Czech c with accent.

(This problem is known also with Czech character g with accent. &#287;)

Here is a related issue:
http://bugs.mysql.com/bug.php?id=9091&thanks=3&notify=3

How to repeat:
Create a script in a utf-8 document and use it to insert these two characters &#269; and &#287; into a utf-8 mysql database. Then look at both characters in a hex editor or retrieve them back to a website again; you won't get the original characters anymore.

Here is my setup:
mysql 4.1.12-nt running on WinXP
default-character-set=utf8
character_set_server=utf8 
character_set_system=utf8 
character_set_database=utf8 
character_set_client=utf8 
character_set_connectionu=tf8 
character_sets_dir=C:\mysql\share\charsets
character_set_results=utf8

Suggested fix:
I do not know how you should fix this, but I know that if you use set names 'utf8' in front of every and each call (i.e. select or insert statement) from a script, everything works ok. A little cumbersome solution. :)

Hi,

Can you please give me the output of the following command in the system that you reproudced this bug in:

mysql> show variables like 'char%';

This will basically tell us what character sets are you using for the client and server.  If the server charsets are utf8, but not the client, then calling SET NAMES UTF8 will fix this problem, as it will make the client charset utf8.

Thanks.

With a help of others I found out what the problem is. If you specify the default-character-set=utf8 in my.ini, it does not necessarily mean that MySql will use this condition. Some versions of my.ini file have the following remark next to the variable "..only client applications shipped by MySQL are guaranteed to read this section." However, my.ini in some packages do not provide this remark (I had one of them)!

Why do applications not shipped by MySql ignore this condition in my.ini??? It really sucks. I spent three or four days before I found out this crucial aspect - I was almost ready to throw it out of window and get some other dbase system.

There's a dedicated API call to read my.cnf, and some MySQL client applications may have decided *not* to use it and *not* to read my.cnf

The text in the manual means that we know that our clients are written correctly, use API the way it should be used, and read my.cnf/my.ini

But we have no control over what other application developers do - if they chose to ignore my.cnf - it's their decision, we cannot help it.

Just to note:

I got a version that I downloaded from MySQL web at this address:

http://dev.mysql.com/downloads/mysql/4.1.html
Windows Essentials (x86)
mysql-essential-4.1.12-win32.msi

So, it looks like a MySQL distributed MySQL does not read the condition.
(Sorry, I do not know what API is.)

So then what's the point of having the character_set* variables if the client *always* gets to override them? They'd be better off being upgraded to forced_character_set* variables, so the server doesn't get coerced into doing something which the users couldn't have seen coming (at least not before they put 12-20 man-hours research into it). 

At least we have a workaround with SET NAMES * so we can force the connection charset client-side; mostly because most users will be using something other than the standard client (which, by the looks of it during this conversation, doesn't always play nice, either) to connect and make their queries with. (like perl DBI, PHP or what-have-you). 

So with said workaround we can go back to doing our development, but that still leaves the BIG-ASS-ELEPHANT-SIZED bug within mysql's internal encoding conversion engine (or whatever is in charge of doing the conversions between charsets). The point of having a utf-8 table (or DB) is that you can store data in it from different languages and different legacy charsets (cp1250, latin1, shift-JIS, greek, czech, ciryllic, whatever), right? So, SUPPOSEDLY, mysql can convert from any of those other charsets into utf-8 right before it puts said data into the table or DB (of course, granted *you* tell it what charset the data you are sending is encoded in). And it sounds perfect but that's where the bugs come in. Simple example, create a table (I have 4.1.11-4 debian but I think all 4.1.x have it, maybe the ones before that, too):

 create table utf ( data varchar(10) ) charset =utf8;

and then (using the regular client) we tell the server we will be sending in latin1 data (so it know s what to convert from and into utf8):

 set names 'latin1';

now, just for reference, we insert a couple of simple chars:

 insert into utf values(char(0x5b));

 insert into utf values(char(0x7f));

and check what actually got into the db:

 select data,hex(data) from utf;

+------+-----------+
| data | hex(data) |
+------+-----------+
| [    | 5B        |
|     | 7F        |
+------+-----------+

Ok, that was flawless since those codepoints are encoded the same both in latin1 and utf8.
Now let's try some chars where the server will have to encode them:

 insert into utf values(char(0xc1));

 insert into utf values(char(0xf9));

 select data,hex(data) from utf;

+------+-----------+
| data | hex(data) |
+------+-----------+
| [    | 5B        |
|     | 7F        |
| Á     | C381      |
| ù     | C3B9      |
+------+-----------+

(we are not interested now in the 'data' column, since you'll different chars or none at all depending also on your xterm configuration)

BEAUTIFUL! we see that mysql did the right encoding for 
0xC1 (latin1)  --->  0xC381 (utf8)     and
0xF9 (latin1)  --->   0xC3B9 (utf8)

(if you want to verify this, here's one reference to save you the trip to google
 http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
sorry I didn't hyperlink it)

Ok, now let's try another one:

 insert into utf values(char(0x81));

 select data,hex(data) from utf;
+------+-----------+
| data | hex(data) |
+------+-----------+
| [    | 5B        |
|     | 7F        |
| \uffff     | C381      |
| \uffff     | C3B9      |
| ?    | 3F        |
+------+-----------+

what the...? 
0x81 (latin1)  --->  0x3F (utf8) ??? and not 0xC280 like Perl says?

ok, surely it was a fluke, let's try another one:

 insert into utf values(char(0x89));

 select data,hex(data) from utf;

+------+-----------+
| data | hex(data) |
+------+-----------+
| [    | 5B        |
|     | 7F        |
| \uffff     | C381      |
| \uffff     | C3B9      |
| ?    | 3F        |
| \uffff     | E280B0    |
+------+-----------+

Now that last one I can't even begin to fathom what the hell the server _tried_ to do.

Do you see now?

And that's the reason for all the 'Some (not ALL) utf-8 chars getting corrupted' bugs from different people with different charsets, including bugs #8672, 8730, 8943, 8949, 8973, 9046, 9091, 9146, 9269, and this one (and only because around the time of #9269 I figured the problem out and stopped looking for them; but, by the looks of it, there's one bug report about this about every 100 bug reports or so).  What was happening was that, for example, when I tried inserting into the DB (through a perl app) the char 'Á' (hope you see it, it's an 'A' with an acute accent), perl sent the binary data 0xC381 (which is this char encoded in utf-8) to mysql, but due to the character_set* variables "issue", mysql thought it was receiving latin1 data through that connection, so it proceeded to encode into utf-8 the already encoded utf-8 data. So it would take the first byte (0xC3) and encode it -correctly- into 0xC383, followed by the encoding for the second byte (0x81) which, as I just showed you about 2 examples ago, gets incorrectly encoded into 0x3F instead of 0xC280. Now the data has been corrupted and there's no way in hell we're gonna get back from the DB what we put in there.

Had mysql server not messed up this conversion, and had instead transparently encoded to and from utf8, it would have taken a very curious individual to figure out that, in fact, they were using almost twice the space for their data inside the DB and why (doubly-utf8-encoded data!). And everybody would've kept on going with their development never realizing that having the right settings set during a connection does matter.

Now, so I can get some peace of mind, is it or is it not a BUG? 

It has to be, since, even though in all of the bugs about this I checked all of us were having the same issue (some chars getting garbled/eaten/corrupted) because we were sending utf8 data into a utf8 table, in case I *wanted* to have an app in which I send latin1 data to an utf8 table, with every setting properly configured, and, say, for example, tried to store every char with a value from 0x00 to 0xff, it would blow up in my face.

Thx for reading this far, hope you don't ignore it like most of the other bug reports I mentioned.

And please, for the love of god, don't come back with "well, latin1 chars with values between 0x80 and 0xA0 aren't REALLY chars, so that's why the server doesn't encode them properly" 'cause that's also the case for latin1 chars with values between 0x00 and 0x1f, and those don't get to take a trip to the twilight zone when entered into the db.

[/quasi-rant]

heh

Just wanted to add, you only need to do a 
SET NAMES 'utf8' 
for each *connection*, not each insert or update or what-have-you.

So you can have a script with
ONE Connection
ONE SET NAMES 'utf8'
345645608 INSERTS

Hi Jose, thanks for your explanation. This issue is already closed, so I am not sure if the dev people will take a look at it anymore. Shall we open a new bug report with all your findings to convince them? I agree too that this is a bug.

Sure thing. It wouldn't hurt, anyway. Since I think a lot of people are having this problem and the solution is not that obvious. Me, I just want them to acknowledge thet there *is* a bug there. 

Cheers

Was this issue ever resolved? I'm trying to enter Czech characters into a MySQL database and am getting strange results. I don't understand why changing the collation to ut8-czech-ci doesn't work. Shouldn't it? Is there anything I can do? Thanks.

I'm still waiting for an answer. When I use Czech chars I have problems entering them into the database and retrieving them from the mysql database. And it's not just two characters; it's four of them. R, c, e with haceks, and u with the little dot over it. Can you guys please help?