MySQL Bugs: #8730: Unicode characters garbled

Bug #8730	Unicode characters garbled
Submitted:	23 Feb 2005 13:57	Modified:	28 Jun 2005 6:28
Reporter:	Denis Wernert	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: MyISAM storage engine	Severity:	S2 (Serious)
Version:	4.1.9	OS:	Linux (Linux)
Assigned to:	Alexander Barkov	CPU Architecture:	Any

Description:
When attempting to store UTF-8 encoded contents in a MyISAM table using utf8_general_ci collation, some characters get corrupted. Strangely, this does not happen when using the Latin1 collation.

How to repeat:
Create a table with UTF8 collation including a text field. Create a new record using the following data: "&#1053;&#1072; &#1073;&#1077;&#1088;&#1077;&#1075;&#1091; &#1087;&#1091;&#1089;&#1090;&#1099;&#1085;&#1085;&#1099;&#1093; &#1074;&#1086;&#1083;&#1085;", and then issue a select statement on the data. Returned data does not match input, at least 2 characters are corrupted.
CREATE TABLE `foo` (`bar` text) ENGINE=MyISAM DEFAULT CHARACTER SET=utf8 ;
INSERT INTO `foo` (`bar`) values ('&#1053;&#1072; &#1073;&#1077;&#1088;&#1077;&#1075;&#1091; &#1087;&#1091;&#1089;&#1090;&#1099;&#1085;&#1085;&#1099;&#1093; &#1074;&#1086;&#1083;&#1085;') ;
SELECT * FROM `foo` ;

Could you please to provide your output of
show variables like 'char%';

?

Before all, thanks for your interest and quick reply.
Here it comes :

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_results    | NULL                       |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

Denis, I'am afraid your client side wasn't setup properly.  You need to setup charset of client side before you send any queries to server. Could you please check your example but this way:

set names koi8r;
create table foo (bar text) CHARACTER SET=utf8 ;
insert into foo values (' &#1085;&#1072; &#1073;&#1077;&#1088;&#1077;&#1075;&#1091; &#1087;&#1091;&#1089;&#1090;&#1099;&#1085;&#1085;&#1099;&#1093; &#1074;&#1086;&#1083;&#1085; ');

Well. Of course if you console was set to (for instance) to cp1251, you must set cp1251 instead of koi8r in this example;

Hi Aleksey,
I did try your solution and it did not help.
Guess I should mention here that the characters have been escaped by the bugtracker, since I'm not trying to insert entity characters but the actual characters.
The russian text was just a random pick, the table is expected to hold data from many different languages, such as chinese, hebrew, french, etc.
I also tried to set the different character_set_* variables to different values, which did not help much.
Trying to store UTF-8 data in a table I would always get crippled data, except when using something else than UTF for the table encoding. Maybe it is helpfull to say that UCS-2 works like a charm.
I also double-checked the input encoding, and I'm pretty sure all input has the expected encoding (btw, the data comes from PHP, but I don't think that makes much of a difference).
Yet, I've read through the documentation once again, and I stumbled upon this :
"Currently, the utf8_unicode_ci collation has only partial support for the Unicode Collation Algorithm. Some characters are not supported yet. Also, combining marks are not fully supported. This affects primarily Vietnamese and some minority languages in Russia such as Udmurt, Tatar, Bashkir, and Mari."
Could that possibly be the reason ?

There really is a problem with utf-8 encoding. I had been scratching my head for 3 days straight till I decided to try every variant and test I could think of. The reason you see no problem when using latin1 encoding is that, for mysql at least, they're all valid 8-bit characters, and it just spews them back to you when queried for, the proper interpretation is being done by your end-point (your terminal or maybe a browser). I have the same problem with just *some* utf-8 characters getting garbled. From the latin1 group (U+00FF), but in utf-8, the chars that go into MySQL but don't come out the on the other side (at least in my version, 4.1.9) are:

U+00C1, U+00CD, U+00CF, U+00D0 and U+00DD. All other codepoints in the range go in and out with no problem at all. 

I made a perl script to test all of the codepoints below 300 decimal, to see, in bytes, what perl was putting into mysql and what it was immediately getting back from it, check part of the output (hope you get to see the special chars, explanation is below it, cause it will look like garbage to you at first):

Antes: ¼*2%1!c2bc&   DESPUES: ¼*2%!c2bc&
Antes: ½*2%1!c2bd&   DESPUES: ½*2%!c2bd&
Antes: ¾*2%1!c2be&   DESPUES: ¾*2%!c2be&
Antes: ¿*2%1!c2bf&   DESPUES: ¿*2%!c2bf&
Antes: À*2%1!c380&   DESPUES: À*2%!c380&
Antes: Á*2%1!c381&   DESPUES: &#65533;?*2%!c33f&
Antes: Â*2%1!c382&   DESPUES: Â*2%!c382&
Antes: Ã*2%1!c383&   DESPUES: Ã*2%!c383&
Antes: Ä*2%1!c384&   DESPUES: Ä*2%!c384&
Antes: Å*2%1!c385&   DESPUES: Å*2%!c385&
Antes: Æ*2%1!c386&   DESPUES: Æ*2%!c386&

These were codepoints U+00BC thru U+00C6. There's some more diagnostics info in there, but the important part is between the '!' and '&', signs; on each line, the first hex number between these symbols represents what perl put into the MySQL DB, and the second one (far right) represents what it got back immediately afterwards. I selected this group of chars cause, right in the middle, in the sixth line, you can see that that character got garbled up by mysql. It happens every time, only with that char and the other ones I mentioned above (and, no doubt, lots others still higher up in unicode number).  

I'm guessing if there's some files where MySQL maps diferent codepages to unicode there have to be some mistakes in the mapping. If it were a configuration problem, no unicode char would make it through, but, alas, most of them do, only a few don't make it.

All character_set* vars are utf-8 and the 3 collation vars are utf8_unicode_ci (also tried utf8_general_ci).

TO REPLICATE:

use this perl script, just replace the DB and connection data with yours, and make sure the test database has default charset utf-8 and a varchar column named 'nombre' and that said column is the second one (or modify the script a little).

use DBI;
use strict;
use utf8;
use encoding 'utf8';

$|=1;

sub long {
use bytes;
my $cad = shift(@_);
return '*'.length($cad).'%'.utf8::is_utf8($cad).'!'.unpack("H*",$cad)."&";
}

binmode (STDOUT,':utf8');

my $basedatos="db";
my $tabla="table";
my $usuario="user";
my $host="localhost";
my $puerto=3306;
my $passw='pass';
my $comando;
my $dbparams="DBI:mysql:database=$basedatos;host=$host;port=$puerto";
my $dbhandle=DBI->connect($dbparams,$usuario,$passw,{ RaiseError => 0}) || die localtime() . " No se puede conectar a MySQL: " . DBI->errstr . "\n";
my $i;
while($i++ < 300) {
	my $cr = chr($i);
	$dbhandle->do("DELETE FROM $tabla");
	print "\nAntes: ".$cr.long($cr)."   ";
	$comando = "INSERT INTO $tabla(nombre) VALUES(".$dbhandle->quote($cr).");";
	my $cmd=$dbhandle->prepare("$comando") || die localtime() . " Error al preparar el comando SQL: " . $dbhandle->errstr . "\n";
	$cmd->execute() || die localtime() . " Error al ejecutar el comando 1: " . $cmd->errstr . "\n";
	$comando = "SELECT * FROM $tabla";
	$cmd=$dbhandle->prepare("$comando") || die localtime() . " Error al preparar el comando SQL: " . $dbhandle->errstr . "\n";
	$cmd->execute() || warn localtime() . " Error al ejecutar el comando 2: " . $cmd->errstr . "\n";	
	my $arrcap=$cmd->fetchrow_arrayref();
	print "DESPUES: ".@$arrcap[1].long(@$arrcap[1]);
	}
$dbhandle->disconnect();
print "\n\nFIN\n";

I hope you can nail it down. The good news is I can use latin1 in mysql tables and put utf-8 chars into it and it won't notice. It works just as I thought utf-8 should be working.

I am having the same problem with unicode character '\u30cd' when
converted to UTF8 (hex e3 83 8d in utf8).  The data is put into the
database with "load data infile" - same values for %_set_% variables as
described below - everything is latin1 (default, not explicitly set, my
understanding is by setting things to latin1, you get different behaviour than if you just default everything).

Just before the "LOAD DATA INFILE ...." I do a SET NAMES 'utf8';

The character gets corrupted - it is stored corrupted so looks like somthing on the insert side of the logic.

An additional thing I noticed is that U+30CD maps into the UTF8
hex E3 83 8D.  8D as a 7-bit character - is an ascii <CR>.

My theory is that the LOAD command is perhaps treating this as special before it looks to see what the character set is set to?   Something to do with escape character processing in the load command which seems to be done independent of character set?

The logic in READ_INFO::next_line() where the
#ifdef USE_MB logic is located looks suspect - this is not
the exact code, but it sort of gets the idea across:

   for(int i=1; chr != my_b_EOF....) chr=GET;
   if (chr == escape_char) continue;

I am having trouble exactly following the code, but it seems like a likely place
to start looking.

Additional testing shows that an SQL Insert from Python using MySQLdb works and the utf-8 data can be extracted with a select and all is well.

Looks like the problem is specific to LOAD DATA INFILE in my case.  This may be a similar but different bug.  Let me know if I should open a new defect.

Some additional information.  If you do not specify:
   SET NAMES 'utf8';
before the load - just leave everything latin1 defaults,
then the load works just fine.

Seems to be some interaction
where if the client is utf-8 and server is latin1, but the column
is defined as utf-8, data is corrupted.  In this case, the connection,
client and result's were all set to utf-8, the system was utf8 but
the database and server were still latin1.

Should this work or is this another case of operator error?

Hi,

Thanks for your bug report.

I was able to reproduce this bug using Jose Alejandro's PERL script.

Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.mysql.com/documentation/ and the instructions on
how to report a bug at http://bugs.mysql.com/how-to-report.php

Additional info:

I cannot reproduce the problem. Everything is fine on MySQL side.
Both latin1 -> utf8 and utf -> latin1 conversion works fine.

Please check these scripts:
1. Using UTF8 column:

SET NAMES utf8;
SELECT HEX('Á'),HEX('Í'),HEX('Ð'),HEX('Ý');
DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a char(10) character set utf8);
INSERT INTO t1 VALUES ('Á'),('Í'),('Ð'),('Ý');
SELECT hex(a), a FROM t1;

You should get these results:

HEX('Á')        HEX('Í')        HEX('Ð')        HEX('Ý')
C381    C38D    C390    C39D
hex(a)  a
C381    Á
C38D    Í
C390    Ð
C39D    Ý

2. Using latin1 column:

SET NAMES utf8;
SELECT HEX('Á'),HEX('Í'),HEX('Ð'),HEX('Ý');
DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a char(10) character set latin1);
INSERT INTO t1 VALUES ('Á'),('Í'),('Ð'),('Ý');
SELECT hex(a), a FROM t1;

You should get these results:

HEX('Á')        HEX('Í')        HEX('Ð')        HEX('Ý')
C381    C38D    C390    C39D
hex(a)  a
C1      Á
CD      Í
D0      Ð
DD      Ý

I believe everything should be fixed if you add "SET NAMES utf8"
into your Perl script.