Bug #36043 UTF-8 string is broken
Submitted: 13 Apr 2008 22:03 Modified: 16 Apr 2008 17:09
Reporter: Victor Porton Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: General Severity:S2 (Serious)
Version:5.0.51a OS:Any
Assigned to: CPU Architecture:Any

[13 Apr 2008 22:03] Victor Porton
Description:
An UTF-8 string saved to the TEXT field of the DB and then retrieved is broken (some characters are replaced with nonsense).

I'm not sure whether it is bug of MySQL or of Perl's DBI.

How to repeat:
Use the table and the Perl program from the following archive.
http://portonvictor.org/misc/sqlbug.zip

Note that Perl code is in UTF-8 encoding.
[16 Apr 2008 16:32] Susanne Ebrecht
Many thanks for reporting a bug.

This is not a bug. Lemme try to explain:

Your database: UTF8

Let's say as example:

mysql> show variables like '%char%'
character_set_client Latin1

Now you want to insert data:

Your terminal or your environment use UTF8 encoding.

You type some text here. Because character_set_client is set to Latin1 the mysql daemon know thinks that it will get Latin1 data and will recode them to UTF8.

You can see if this happens by using length().

UTF8 needs 1-4 bytes for one sign and Latin1 always just need one sign.

Just insert into a text field letters like 'ä' or 'ö'.

Then do: SELECT LENGTH(my_text_field) FROM my_database;

These letter need 2 bytes in UTF8. When you now won't get 2 bytes for one of these letters then your data are stored wrong in your database.

When you will get 1 or 4 bytes then this is wrong.

When this is happened, that the data already stored wrong at the database then you don't have a chance to repair it easily. You have to dump the data to recode them manually and to upload them again.

To avoid this: Look that your environment encoding (terminal encoding or whatever) will match to the character_set_client variable. Means on Unix/Linux: Use ISO-8859-15 for Latin1 and UTF8 for UTF8.

The other way to avoid it is:
Look what you use as environment encoding and then set the server variables to that environment encoding.
For example: when your environment is using utf8 then make a "SET NAMES UTF8" before you will start inserting data.

The set names command will set some other needful variables too.

When your data are stored right at the database, then the problem isn't so difficult to solve.

When you will get curious stuff out of the database then you have to look if character_set_client and your environment encoding will match.

If character_set_client is Latin1 and your environment is UTF8 then switch your environment to ISO-8859-15.
The other possiblitiy is here to make a "SET NAMES UTF8" to change character_set_client and some other necessary variables to UTF8.
[16 Apr 2008 16:54] Peter Laursen
@both of you ..

to this 

"When this is happened, that the data already stored wrong at the
database then you don't have a chance to repair it easily. You have to
dump the data to recode them manually and to upload them again."

there IS an easier way

refer to: http://webyog.com/faq/content/34/152/en/my-special-characters-display-as-_noisy_-latin-cha...

"So basically what you need to do is to perform changes (ALTER TABLE) in two operations like: latin1 varchar >> varbinary (or BLOB) >> utf8 varchar (and similar for char and TEXT types)... for every encoded column" And the same the other way around (utf8 >> latin1 or whatever) of course!

@Susanne: how would you *in practice* do this recoding of a dump?  
Why not use the *knowledge* of the MySQL server about encodings? 'pushing in' a non-encoded step is the solution!
[16 Apr 2008 17:09] Victor Porton
I already have UTF-8 set database, indeed it does not work.

character_set_client  	utf8
character_set_connection 	utf8
character_set_database 	binary
character_set_filesystem 	binary
character_set_results 	utf8
character_set_server 	cp1251
character_set_system 	utf8
[28 Apr 2008 15:20] Susanne Ebrecht
To figure out, which character set your database is using you should use the command:

SHOW CREATE DATABASE dbname;

Also SHOW CREATE TABLE table_name;