MySQL Bugs: #9091: Characters 'č' and 'ğ' do not work in 4.1

Bug #9091	Characters 'č' and 'ğ' do not work in 4.1 - they worked in 4.0
Submitted:	10 Mar 2005 12:54	Modified:	12 Mar 2005 8:50
Reporter:	Jan Novak	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server	Severity:	S2 (Serious)
Version:	4.1	OS:	Linux (linux, w2k)
Assigned to:		CPU Architecture:	Any

Description:
I work on a project which has 50 languages in sql tables. Everything worked fine with 4.0 but in 4.1 "&#269;" czech character c with accent does not work. I use exactly the same methods to fill the table all other things are same - apache, php all scripts and input data are also the same. Most of other characters work OK for example "š&#345;žáí&#357;&#271;" work OK. But &#269; does not work. MySQL select returns ?? in firefox and something like "L." in IE. You can see the problem at:
http://dicts.info/di1.php?k1=1&k2=21&search=speech
On the left there should be:
speech = &#345;e&#269;
instead of
speech = &#345;e&#65533;?
also g with accent searches for &#351; instead for g with accent
try: 
http://dicts.info/di1.php?k1=97&k2=214&search=%C4%9F 
or
http://dicts.info/di1.php?k1=97&k2=214&search=&#287;
you can see 4.1 mysql server status at:
http://dicts.info/sqsta.php
there are some encodings set to latin1 but I have tried to install 4.1 with utf-8 enabled everywhere and it did the same thing.

Another thing is that indexes did not work as they used to work in 4.0 I had to delete them all to be able to use select. If I searched for "av%" I got only first part of results for "where word1 like av% or word2 like av%"
There were no matches for word2 found even if there were many. I used 2 character "word" index for word1 and word2 column.

How to repeat:
Create file input.txt with UTF-8 encoding, copy some lines with "&#269;"  "&#287;" (c and g with accent) into them and load them with PHP into mysql 4.1 (careful everything works in 4.0! it does not work in 4.1) with fopen and then insert them into mysql table. Yes utf-8 works in PHP - it can read and insert even chinese or japanese chars without problems. Then take a look at the table in mysql I have "speech  	Å™eÄ?  	" there in 4.1 in 4.0 there is normally "speech     &#345;e&#269;". Select * from the table and display it in the browser while you set header of the page to utf-8 you will see that all characters except for &#269; are displayed. 4.1 is not able to search for &#287;

Suggested fix:
There is something different in 4.1. Everything worked in 4.0. Take a look at the difference how &#269; is handled in 4.0 and in 4.1 and compare it to for example &#345;. &#345; works &#269; does not work. Take a look at how mysql 4.1 selects strings with &#287;. You know what exactly is different in 4.1. There is some problem in how these chars are handled with.

Hi!

If you are using UTF-8, then those characters are not valid single-byte UTF-8 char codes, and that may be the cause of the problem. I guess the Czech c has a 2-byte code in UTF-8.

Regards,

Heikki

That's pretty much the standard reply for utf8 trouble around here, don't expect much help. I was having the same kind of troubles and I was even following several similar bugs, but none has been resolved. Look at bugs #8672, 8730, 8943, 8949, 8973 and 9046.

Looking at all those bugs it's evident that mysql (4.1.x) is having utf8 trouble only with some chars, but not the majority.

After wading through the charset hell a while longer, I think I've found out why this is:

the character_set_* variables tell mysql to encode and/or decode data upon receipt from the client or just before posting results (this much is standard knowledge, bear with me). SUPPOSEDLY, if you set these global variables, every connection afterwards will respect these values. Thing is, that's not happening (or wasn't happening in my case), so mysql is receiving your utf8 encoded strings and then REENCODING them again internally as utf8, decoding them from utf8-utf8 to plain utf8 on the way out. This process is buggy and some byte values are not correctly encoded and the n can't be decoded later. There are two ways to verify if this is happening to you:

1) create a table with utf8 charset and insert (using PHP, this is very important, you won't see the effect from the standard client) some chars you know work, one per insert, and other chars you know DON'T work, now do a SELECT col_name,hex(col_name) FROM table; and see what exactly is in there. In my case, a single char, which in utf-8 is encoded as two bytes, was encoded as 4 bytes inside mysql. If that's the case with you, the way to fix it is:

2) I use DBD::mysql, but it should work with PHP also. I'm not sure how PHP<->mysql connections are handled within PHP, but right after you connect and just before you do any inserts or selects on your tables, do a SET NAMES 'utf8', which will ensure that mysql doesn't reencode already encoded utf8.

Hope that helps.

Ok, I see the confusion now.

"SUPPOSEDLY, if you set these global variables, every connection afterwards will respect these values"

In fact, server sets clien/connection/results character sets to global values ONLY if the client did not provide its preferred character set (or client's value was invalid). That is client's --default-character-set option override server's global variables.

Without --default-characters-set client's default is usually latin1.
So if you send a utf8 string, a server THINKS it's latin1 and "converts" its to utf8, which is, of course, not what you wanted.

SET NAMES utf8 in the beginning of every section is correct solution.
Another solution is to set client's defaults with --default-character-set

Hi,

a was having the same problem with the &#269; (Czech letter c with accent) as well. This suggestion: "Another solution is to set client's defaults with --default-character-set" did not work for me. However, I was successful with SET NAMES 'utf8' in front of every sql query. Thanks Jose!

I want to also add to the fact that this really is a bug in mysql. Here is some info about it:

Czech character c with accent:
unicode:
 decimal 269
 hex 010D
utf-8:
 decimal: 196 141
 hex C48D

If you take this character and post it into your utf-8 MySql database, it gets saved as hex C43F. This translates into unicode as decimal 79, hex 4F which is a totally different character from Czech c with accent.

I am running MySql version 4.1.12-nt.

Ps: I feel so sorry for all the folks that are trying to have their Czech web in MySQl utf-8. I am quite new to all this and spent tree days on this problem.

Pps: This is a good resource for char set conversions.
http://mlha.cz/unicode/utf8.php

I want to add to this one:

"...So if you send a utf8 string, a server THINKS it's latin1 and "converts" its to
utf8, which is, of course, not what you wanted. ...Another solution is to set client's defaults with --default-character-set."

The problems exests even if you specify the default-character-set. The following are my variables which I had set up and still was experiencing the problem.

default-character-set=utf8
character_set_server=utf8 
character_set_system=utf8 
character_set_database=utf8 
character_set_client=utf8 
character_set_connectionu=tf8 
character_sets_dir=C:\mysql\share\charsets
character_set_results=utf8

The only solution that seems to help is to use set names 'utf8' in front of every and each call from a script. A little cumbersome solution. :)

I also have this problem, as we have such letter in Lithuanian too. Using mysql_query("SET NAMES 'utf8'"); in php code helped, but it's only a workaround, not a correct sollution. Why this bug is closed without any info?

See BUG#9948

I experienced the same bug on Linux 4.1.13 with Russian с, н and few other chars as well. It was not the case on win32 4.1.16.

Thanks for your comment Jose, "SET NAMES 'utf8'" worked out just fine.