Bug #37739 | search query does not recognize German Umlauts | ||
---|---|---|---|
Submitted: | 30 Jun 2008 12:34 | Modified: | 30 Jun 2008 12:44 |
Reporter: | Sulakshana Deshmukh | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Server: FULLTEXT search | Severity: | S2 (Serious) |
Version: | mysql 5 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | German Umlauts in search query |
[30 Jun 2008 12:34]
Sulakshana Deshmukh
[30 Jun 2008 12:44]
Susanne Ebrecht
Many thanks for writing a bug report. This is not a bug and let me explain in next statement why.
[30 Jun 2008 13:21]
Susanne Ebrecht
You have to consider lots of character set settings here. First of all, please test if your data are stored right at your database. What character set is your column in which you store the Umlauts are using? UTF8 or Latin1? Then please make a SELECT LENGTH() for the column in which you stored on or more Umlauts. One example for the SELECT: SELECT LENGTH(your_umlaut_column) from your database where id_of_your_row=id_from_row_where_column_with_umlaut; Now you get back how many bytes the system is using for your text. Letter from a-z and A-Z should be counted as 1 for a single letter. It's important what your system counted for ä,ö,ü,ß. If it counted 2 here for an 'ä' or another Umlaut then it's UTF8. If it counted 1 then it is LATIN1. If your column is LATIN1 and it counted 2 here. Then your data are stored wrong. If your column is UTF8 and it counted 1 here. Then your data are stored wrong. If this is your problem, then you have to repair this first. Means: dumping the data, recode them and store them back. Or just use the expected "SET NAMES" parameter here. You have three variables: character_set_client, character_set_result and character_set_connection. These three will be set by the command: "SET NAMES character_set_value" Most times character_set_value is: utf8 or latin1 or on German Windows: cp850 You can display the values of these variables by using: SHOW VARIABLES LIKE 'char%'; It doesn't matter which character set your database or your column is using. These three variables are the variables of your interest. The rest will be converted transparent and automatically. Just to have an example let's say your column is UTF8. First of all you know have to look, what encoding your environment is using. On Windows: open a command Window and type: CHCP ... usually result is CP850 On Linux: if you use Gnome terminal: there is a bar on the top of your terminal. Click there on terminal and then "encoding settings" or in German: "Zeichenkodierung festlegen". Look what encoding is marked there. if you are using a KDE terminal: there is a bar on the top of your terminal. Click on "Settings->Encoding" or in German: "Einstellungen->Kodierung" and look what is set there. If it is set to "standard" or "default" then you have to type at the Shell: $locale Usually most of Linux software will just use "default" encoding. Means, when you can't find an encoding setting in your environment it should use the value from: $ locale UTF8 will match to UTF8 and ISO-8859-15 will match to LATIN1. After you figured out which encoding your environment is using you have to change the variables to the encoding of your environment. If your environment is using UTF8 then: SET NAMES UTF8; If it is using ISO-8859-15 then: SET NAMES LATIN1; If it is using code page 850 then: SET NAMES CP850; Now you can start inserting data with Umlauten. For inserting only character_set_client is interesting. And of course character_set_connection. The character_set_client value tells the system, what encoding you are using. If you are using ISO-8859-15 and has set the character_set_client to latin1 and the column in which you want to store data expects utf8 then the system will automatically and transparent convert your inserts into utf8. Means i.e. making 2 bytes out of your 'ä'. If you are using UTF8 and has set the character_set_client to utf8 and the column in which you want to store data expects latin1 then the system will automatically and transparent convert your inserts into latin1. Means i.e. making a 1 byte 'ö' out of your inserted 2 byte 'ö'. For results ... means for all what you want to display on your computer you need also the character_set_result. This should also be set to your environment encoding, if you want to read the signs correct. When you have an ISO-8859-15 environment and the column is UTF8 and you made a "set names latin1;" then the system nows: "ahhh the 'ä' from the 'where a='ä'' is coded in latin1 and I have to convert it to utf8 before comparing. Then it compares and let's say it will find an 'ä'. But this 'ä' is utf8 encoded. Now the system is looking to character_set_result and will convert the 'ä' to the value of character_set_result. After that it will give it back to your client and you can read it proper. Same of course vice versa with UTF8 and LATIN1 or also with CP850.