MySQL Bugs: #8949: MySQL doesn't recognize some UTF-8 chars

Bug #8949	MySQL doesn't recognize some UTF-8 chars
Submitted:	4 Mar 2005 17:08	Modified:	4 Mar 2005 17:20
Reporter:	Jose Alejandro Guizar	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server	Severity:	S2 (Serious)
Version:	4.1.9	OS:	Linux (Debian Linux 3.3)
Assigned to:		CPU Architecture:	Any

Description:
I had included the following text under bug #8730, but since I wasn't the one to open it, I can't change the [Feedback needed] flag, so I'm gonna copy it here:

There really is a problem with utf-8 encoding. I had been scratching my head for
3 days straight till I decided to try every variant and test I could think of.
The reason you see no problem when using latin1 encoding is that, for mysql at
least, they're all valid 8-bit characters, and it just spews them back to you
when queried for, the proper interpretation is being done by your end-point
(your terminal or maybe a browser). I have the same problem with just *some*
utf-8 characters getting garbled. From the latin1 group (U+00FF), but in utf-8,
the chars that go into MySQL but don't come out the on the other side (at least
in my version, 4.1.9) are:

U+00C1, U+00CD, U+00CF, U+00D0 and U+00DD. All other codepoints in the range go
in and out with no problem at all. 

I made a perl script to test all of the codepoints below 300 decimal, to see, in
bytes, what perl was putting into mysql and what it was immediately getting back
from it, check part of the output (hope you get to see the special chars,
explanation is below it, cause it will look like garbage to you at first):

Antes: ¼*2%1!c2bc&   DESPUES: ¼*2%!c2bc&
Antes: ½*2%1!c2bd&   DESPUES: ½*2%!c2bd&
Antes: ¾*2%1!c2be&   DESPUES: ¾*2%!c2be&
Antes: ¿*2%1!c2bf&   DESPUES: ¿*2%!c2bf&
Antes: À*2%1!c380&   DESPUES: À*2%!c380&
Antes: Á*2%1!c381&   DESPUES: &#65533;?*2%!c33f&
Antes: Â*2%1!c382&   DESPUES: Â*2%!c382&
Antes: Ã*2%1!c383&   DESPUES: Ã*2%!c383&
Antes: Ä*2%1!c384&   DESPUES: Ä*2%!c384&
Antes: Å*2%1!c385&   DESPUES: Å*2%!c385&
Antes: Æ*2%1!c386&   DESPUES: Æ*2%!c386&

These were codepoints U+00BC thru U+00C6. There's some more diagnostics info in
there, but the important part is between the '!' and '&', signs; on each line,
the first hex number between these symbols represents what perl put into the
MySQL DB, and the second one (far right) represents what it got back immediately
afterwards. I selected this group of chars cause, right in the middle, in the
sixth line, you can see that that character got garbled up by mysql. It happens
every time, only with that char and the other ones I mentioned above (and, no
doubt, lots others still higher up in unicode number).  

I'm guessing if there's some files where MySQL maps diferent codepages to
unicode there have to be some mistakes in the mapping. If it were a
configuration problem, no unicode char would make it through, but, alas, most of
them do, only a few don't make it.

All character_set* vars are utf-8 and the 3 collation vars are utf8_unicode_ci
(also tried utf8_general_ci).

How to repeat:
use this perl script, just replace the DB and connection data with yours, and
make sure the test database has default charset utf-8 and a varchar column named
'nombre' and that said column is the second one (or modify the script a
little).

use DBI;
use strict;
use utf8;
use encoding 'utf8';

$|=1;

sub long {
use bytes;
my $cad = shift(@_);
return '*'.length($cad).'%'.utf8::is_utf8($cad).'!'.unpack("H*",$cad)."&";
}

binmode (STDOUT,':utf8');

my $basedatos="db";
my $tabla="table";
my $usuario="user";
my $host="localhost";
my $puerto=3306;
my $passw='pass';
my $comando;
my $dbparams="DBI:mysql:database=$basedatos;host=$host;port=$puerto";
my $dbhandle=DBI->connect($dbparams,$usuario,$passw,{ RaiseError => 0}) || die
localtime() . " No se puede conectar a MySQL: " . DBI->errstr . "\n";
my $i;
while($i++ < 300) {
	my $cr = chr($i);
	$dbhandle->do("DELETE FROM $tabla");
	print "\nAntes: ".$cr.long($cr)."   ";
	$comando = "INSERT INTO $tabla(nombre) VALUES(".$dbhandle->quote($cr).");";
	my $cmd=$dbhandle->prepare("$comando") || die localtime() . " Error al preparar
el comando SQL: " . $dbhandle->errstr . "\n";
	$cmd->execute() || die localtime() . " Error al ejecutar el comando 1: " .
$cmd->errstr . "\n";
	$comando = "SELECT * FROM $tabla";
	$cmd=$dbhandle->prepare("$comando") || die localtime() . " Error al preparar el
comando SQL: " . $dbhandle->errstr . "\n";
	$cmd->execute() || warn localtime() . " Error al ejecutar el comando 2: " .
$cmd->errstr . "\n";	
	my $arrcap=$cmd->fetchrow_arrayref();
	print "DESPUES: ".@$arrcap[1].long(@$arrcap[1]);
	}
$dbhandle->disconnect();
print "\n\nFIN\n";

Suggested fix:
I hope you can nail it down. The good news is I can use latin1 in mysql tables
and put utf-8 chars into it and it won't notice. It works just as I thought
utf-8 should be working.

Please try to submit your new comments in http://bugs.mysql.com/bug.php?id=8730

and not open a duplicate or related bug report.

Thanks in advance.