Bug #6040 can't retrieve records with umlaut characters in case insensitive manner
Submitted: 12 Oct 2004 8:22 Modified: 18 Oct 2004 10:28
Reporter: Denitsa Pelova Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:4.1.5 OS:Linux (Red Hat 9)
Assigned to: Alexander Barkov CPU Architecture:Any

[12 Oct 2004 8:22] Denitsa Pelova
Description:
Hello, 
the problem is the following: 
I have a database with some texts containig umlaut characters like ü,ä,Ü ... 
If I make a query: 
 
mysql> select name from my_table where name like 'Übertragung%' limit 1; 
+----------------------------------------------------+ 
| name                                               | 
+----------------------------------------------------+ 
| Übertragung aus dem Na Litavce Stadion in Pribram | 
+----------------------------------------------------+ 
1 row in set (0.00 sec) 
 
So I get the result, but if I search like this: 
 
mysql> select name from BroadcastTextInter where name like 'übertragung%' 
limit 1; 
Empty set (0.73 sec) 
 
The second query should be working, as the default character set and collation 
allow case insensitive search, but it doesn't. 
This is working if the query doesn't have umlaut characters in it. 
 
Also if I do: 
mysql> select name from BroadcastTextInter where name like 
'Führerstands-Mitfahrten' limit 1; 
+--------------------------+ 
| name                     | 
+--------------------------+ 
| Führerstands-Mitfahrten | 
+--------------------------+ 
1 row in set (0.00 sec) 
, but: 
mysql> select upper(name) from BroadcastTextInter where name like 
'Führerstands-Mitfahrten' limit 1; 
+--------------------------+ 
| upper(name)              | 
+--------------------------+ 
| FüHRERSTANDS-MITFAHRTEN | 
+--------------------------+ 
1 row in set (0.00 sec) 
 
So the upper() (and the lower() too) functions also don't work with umlauts. 
This is the same on versions 3.23 and 5. 

How to repeat:
To try this, you need a table with a field containig umlaut charactres and 
then try to select it using the opposite case(upper/lower) in the LIKE part of 
the query, or just use upper() or lower() functions on it. 

Suggested fix:
I read in the manual, that when MySQL is comparing in case insensitive manner 
it converts my search string to upper case and then compares the ASCII codes 
so that 'a' and 'A' letters for example, are treated as the same letter. 
On the other hand mysql uses C to do the conversion, but in the 
tolower()/toupper () functions manual (man 3 tolower) it is said that the 
'default locale' doesn't recognize umlaut characters, so when the conversion 
can't be done the input character is returned. This is the same when you try 
to use upper() in the select statement(all umlauts stay in lower case). So 
there is this setlocale() functions which can set the 'locale' to something 
else then the default, and for example using it like this: 
setlocale(LC_CTYPE, "de_DE.iso88591"); 
Allows toupper() and tolower() functions to work correctly
[12 Oct 2004 22:20] MySQL Verification Team
I wasn't able to repeat with a simple table I built. However please
verify if the bug:

http://bugs.mysql.com/bug.php?id=6043

can applied to your case. Otherwise can you send a test case with your
my.cnf ?

Thanks
[13 Oct 2004 10:45] Denitsa Pelova
Hello,  
 
here's an easy test that will help you reproduce the problem: 
 
CREATE TABLE Person ( 
  id bigint(20) NOT NULL, 
  name varchar(255) default NULL 
) ENGINE=MyISAM DEFAULT CHARSET=latin1; 
 
INSERT INTO Person VALUES (1, 'Günter André'); 
INSERT INTO Person VALUES (2, 'Don Überlingen'); 
INSERT INTO Person VALUES (3, 'Annika Kruse'); 
 
SELECT LOWER(name) FROM Person; 
SELECT UPPER(name) FROM Person; 
SELECT name FROM Person WHERE LOWER(name) LIKE lower('annika%'); 
SELECT name FROM Person WHERE LOWER(name) LIKE lower('GÜNTER%'); 
SELECT name FROM Person WHERE UPPER(name) LIKE UPPER('GÜNTER%'); 
SELECT UPPER(name) FROM Person WHERE UPPER(name) LIKE 
UPPER('Günter%'); 
 
- The result of each query  is as follows: 
 
mysql> SELECT LOWER(name) FROM Person; 
+-----------------+ 
| LOWER(name)     | 
+-----------------+ 
| günter andré  | 
| don Überlingen | 
| annika kruse    | 
+-----------------+ 
3 rows in set (0.00 sec) 
- Observe line 2. No lower on  Ü applied. 
 
mysql> SELECT UPPER(name) FROM Person; 
+-----------------+ 
| UPPER(name)     | 
+-----------------+ 
| GüNTER ANDRé  | 
| DON ÜBERLINGEN | 
| ANNIKA KRUSE    | 
+-----------------+ 
3 rows in set (0.00 sec) 
- Observe line 1. No upper on letters ü & é 
 
mysql> SELECT name FROM Person WHERE LOWER(name) LIKE lower('annika%'); 
+--------------+ 
| name         | 
+--------------+ 
| Annika Kruse | 
+--------------+ 
1 row in set (0.00 sec) 
- OK 
 
mysql> SELECT name FROM Person WHERE LOWER(name) LIKE 
lower('GÜNTER%'); 
Empty set (0.00 sec) 
- no match as lower cannot turn Ü into ü in the LIKE part 
 
mysql> SELECT name FROM Person WHERE UPPER(name) LIKE 
UPPER('GÜNTER%'); 
Empty set (0.00 sec) 
- no match as lower cannot turn ü into Ü in the UPPER(name) part 
 
mysql> SELECT UPPER(name) FROM Person WHERE UPPER(name) LIKE 
UPPER('Günter%'); 
+----------------+ 
| UPPER(name)    | 
+----------------+ 
| GüNTER ANDRé | 
+----------------+ 
1 row in set (0.00 sec) 
- worked only because both UPPER(name) and  LIKE UPPER('Günter%') couldn't turn 
ü into Ü and therefore the query looked like: where 'GüNTER ANDRé' like 'GüNTER%' 
 
 
Now, no matter if I have a my.cnf or not, the problem persist.  
But may be I am overlooking something. Here's my my.cnf content 
cat /etc/my.cnf 
----------------------- 
[mysqld] 
datadir=/var/lib/mysql 
socket=/var/lib/mysql/mysql.sock 
 
[mysql.server] 
user=mysql 
basedir=/var/lib 
 
[safe_mysqld] 
err-log=/var/log/mysqld.log 
pid-file=/var/run/mysqld/mysqld.pid 
------------------------------ 
 
The tests have been done on Red Hat 9 and on Fedora Core 2 using mysql v. 3.23,  
4.1.4,  4.1.5,  5.0.1 
 
To your question: 
no, this problem is not same as http://bugs.mysql.com/bug.php?id=6043.  
Changing database and table charset to utf8 doen't help. 
 
Regards 
Denitsa Pelova
[15 Oct 2004 7:37] Alexander Keremidarski
Äåíèöà,

By default 4.1 and 5.0 set all character sets to latin1 with latin1_swedish_ci
What you need is either one of german collations latin1_german1_ci or latin1_german2_ci or utf8 with utf8_general_ci (ci suffix stands for Case Insensitive)

Fedora itself is all utf8 as you know.

> Changing database and table charset to utf8 doen't help. 

Database charset is just a default as well as table charset. How did you changed teh table charset?

ALTER TABLE Person DEFAULT CHARSET utf8;  
this syntax only changes the default and  does *not* convert exisitng data. You can see that with:

mysql> SHOW CREATE TABLE Person;
CREATE TABLE `Person` (
  `id` bigint(20) NOT NULL default '0',
  `name` varchar(255) character set latin1 default NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

If you want to convert the data to a new charset the proper syntax is:
 ALTER TABLE Person CONVERT TO CHARSET utf8;

Now it becomes:
CREATE TABLE `Person` (
  `id` bigint(20) NOT NULL default '0',
  `name` varchar(255) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

Now if client charsets are also utf8 UPPER() and LOWER() will work (change all client character sets to utf8 with SET NAMES utf8) :

mysql> SELECT name FROM Person WHERE LOWER(name) LIKE
    -> lower('GÜNTER%');
+----------------+
| name           |
+----------------+
| Günter André   |
| Günter André   |
+----------------+
2 rows in set (0.00 sec)
 
mysql> SELECT name FROM Person WHERE UPPER(name) LIKE
    -> UPPER('GÜNTER%');
+----------------+
| name           |
+----------------+
| Günter André   |
| Günter André   |
+----------------+
2 rows in set (0.00 sec)

However case insensitive match still does not work:

mysql> SELECT name FROM Person WHERE name LIKE 'GÜNTER%';
Empty set (0.00 sec)

This is because both name and 'GÜNTER%' have charset utf8, collation utf8_general_ci

Let's make it working:
mysql> SELECT name FROM Person WHERE name LIKE 'GÜNTER%' COLLATE utf8_unicode_ci;
+----------------+
| name           |
+----------------+
| Günter André   |
| Günter André   |
+----------------+
2 rows in set (0.00 sec)

Whether it's a bug with utf8_general_co collation is to be investigated by us. Meanwhile I suggest you reading about Collations and set utf8 + utf8_unicode_ci as defaults in both mysqld and client sections of your my.cnf

mysql> ALTER TABLE Person CONVERT TO CHARACTER SET latin1 COLLATE latin1_german1_ci;
Query OK, 6 rows affected (0.04 sec)
Records: 6  Duplicates: 0  Warnings: 0
 
mysql> SET NAMES latin1;
Query OK, 0 rows affected (0.00 sec)
 
mysql> SELECT name FROM Person WHERE name LIKE 'GÜNTER%' COLLATE latin1_german1_ci;
Empty set (0.00 sec)

However when setting all character sets to latin1 with both collations latin1_german1_ci and latin1_german2_ci I get the same results as yours:

mysql> ALTER TABLE Person CONVERT TO CHARACTER SET latin1 COLLATE latin1_german1_ci;

mysql> select * from Person;
+----+-----------------+
| id | name            |
+----+-----------------+
|  1 | Günter André    |
|  2 | Don Überlingen  |
|  3 | Annika Kruse    |
+----+-----------------+
3 rows in set (0.00 sec)
 
mysql> SELECT LOWER(name) FROM Person;
+-----------------+
| LOWER(name)     |
+-----------------+
| g��nter andr��      |
| don ��berlingen   |
| annika kruse    |
+-----------------+
3 rows in set (0.00 sec)
 
mysql> SELECT UPPER(name) FROM Person;
+-----------------+
| UPPER(name)     |
+-----------------+
| GüNTER ANDRé    |
| DON ÜBERLINGEN  |
| ANNIKA KRUSE    |
+-----------------+
3 rows in set (0.00 sec)
 
mysql> SELECT name FROM Person WHERE LOWER(name) LIKE lower('annika%');
+--------------+
| name         |
+--------------+
| Annika Kruse |
+--------------+
1 row in set (0.02 sec)
 
mysql> SELECT name FROM Person WHERE LOWER(name) LIKE lower('GÜNTER%');
Empty set (0.02 sec)
 
mysql> SELECT name FROM Person WHERE UPPER(name) LIKE UPPER('GÜNTER%');
Empty set (0.01 sec)
 
mysql> SELECT UPPER(name) FROM Person WHERE UPPER(name) LIKE
    -> UPPER('Günter%');
+----------------+
| UPPER(name)    |
+----------------+
| GüNTER ANDRé   |
+----------------+
1 row in set (0.00 sec)

This definitely seems wrong.

Thank you a lot for this perfect test case!

Easiest way
[16 Oct 2004 18:26] Alexander Barkov
Denitsa, can you please start 4.1.x again, create data
in "Person" table from the beginning using the same script,
and then run these queries:

SELECT HEX(name) FROM Person;
SHOW VARIABLES LIKE 'character_set%'

Thank you very much!
[16 Oct 2004 18:40] Denitsa Pelova
Hello Alexander, 
here are the queries: 
 
mysql> SELECT HEX(name) FROM Person; 
+--------------------------------+ 
| HEX(name)                      | 
+--------------------------------+ 
| 47C3BC6E74657220416E6472C3A9   | 
| 446F6E20C39C6265726C696E67656E | 
| 416E6E696B61204B72757365       | 
+--------------------------------+ 
3 rows in set (0.00 sec) 
 
mysql> SHOW VARIABLES LIKE 'character_set%'; 
+--------------------------+----------------------------+ 
| Variable_name            | Value                      | 
+--------------------------+----------------------------+ 
| character_set_client     | latin1                     | 
| character_set_connection | latin1                     | 
| character_set_database   | latin1                     | 
| character_set_results    | latin1                     | 
| character_set_server     | latin1                     | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+ 
7 rows in set (0.11 sec)
[16 Oct 2004 19:09] Alexander Barkov
It looks like the problem is that the client sends
everything in UTF8, but the server thinks it is latin1.

Can you please add  "SET NAMES utf8;" in the beginning
of the script, then run it from the beginning once again.

Do SELECTs now return the expected results?

If you run SELECTs in a separate session, please execute
"SET NAMES utf8;" in this session too.

thanks.
[16 Oct 2004 19:30] Denitsa Pelova
Hi again :), here are the results: 
 
mysql> set names utf8; 
Query OK, 0 rows affected (0.00 sec) 
 
mysql> SELECT HEX(name) FROM Person; 
+--------------------------------+ 
| HEX(name)                      | 
+--------------------------------+ 
| 47C3BC6E74657220416E6472C3A9   | 
| 446F6E20C39C6265726C696E67656E | 
| 416E6E696B61204B72757365       | 
+--------------------------------+ 
3 rows in set (0.00 sec) 
 
mysql> SHOW VARIABLES LIKE 'character_set%'; 
+--------------------------+----------------------------+ 
| Variable_name            | Value                      | 
+--------------------------+----------------------------+ 
| character_set_client     | utf8                       | 
| character_set_connection | utf8                       | 
| character_set_database   | latin1                     | 
| character_set_results    | utf8                       | 
| character_set_server     | latin1                     | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+ 
7 rows in set (0.01 sec) 
 
mysql> select upper(name) from Person; 
+--------------------+ 
| upper(name)        | 
+--------------------+ 
| GüNTER ANDRé | 
| DON Ã?BERLINGEN  | 
| ANNIKA KRUSE       | 
+--------------------+ 
3 rows in set (0.00 sec) 
 
mysql> select lower(name) from Person; 
+--------------------+ 
| lower(name)        | 
+--------------------+ 
| gã¼nter andrã© | 
| don ã?berlingen  | 
| annika kruse       | 
+--------------------+ 
3 rows in set (0.00 sec) 
 
mysql> select name from Person; 
+--------------------+ 
| name               | 
+--------------------+ 
| Günter André | 
| Don Ã?berlingen  | 
| Annika Kruse       | 
+--------------------+ 
3 rows in set (0.00 sec) 
 
This time I get these strange looking symbols...
[16 Oct 2004 19:45] Denitsa Pelova
I also tried that: 
 
mysql> set character_set_database= utf8; 
Query OK, 0 rows affected (0.00 sec) 
 
mysql> set character_set_server= utf8; 
Query OK, 0 rows affected (0.00 sec) 
mysql> SHOW VARIABLES LIKE 'character_set%'; 
+--------------------------+----------------------------+ 
| Variable_name            | Value                      | 
+--------------------------+----------------------------+ 
| character_set_client     | utf8                       | 
| character_set_connection | utf8                       | 
| character_set_database   | utf8                       | 
| character_set_results    | utf8                       | 
| character_set_server     | utf8                       | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+ 
7 rows in set (0.00 sec) 
 
mysql> select lower(name) from Person; 
+--------------------+ 
| lower(name)        | 
+--------------------+ 
| gã¼nter andrã© | 
| don ã?berlingen  | 
| annika kruse       | 
+--------------------+ 
3 rows in set (0.00 sec) 
 
mysql> select upper(name) from Person; 
+--------------------+ 
| upper(name)        | 
+--------------------+ 
| GüNTER ANDRé | 
| DON Ã?BERLINGEN  | 
| ANNIKA KRUSE       | 
+--------------------+ 
3 rows in set (0.00 sec) 
 
But still not working..
[16 Oct 2004 19:46] Alexander Barkov
Denitsa, 

You added "SET NAMES utf8;" before running SELECTs and SHOW.
This is done correctly.

But you need to recreate data with "SET NAMES utf8;" too.

Please add "SET NAMES utf8;" in the beginnin of the script
which create the table and inserts the records. And run it 
again.

The run SELECTs once again, with "SET NAMES utf8;" in the beginning.
[17 Oct 2004 7:52] Denitsa Pelova
Hi Alexander, I did it again the way you said and here are the results: 
 
mysql> set names utf8; 
Query OK, 0 rows affected (0.07 sec) 
 
mysql> create table Person (name VARCHAR(255) DEFAULT NULL) ENGINE=MyISAM 
DEFAULT CHARSET=utf8; 
Query OK, 0 rows affected (0.09 sec) 
 
mysql> INSERT INTO Person VALUES ('Günter André'); 
Query OK, 1 row affected (0.00 sec) 
 
mysql> INSERT INTO Person VALUES ('Don Überlingen'); 
Query OK, 1 row affected (0.00 sec) 
 
mysql> INSERT INTO Person VALUES ('Annika Kruse'); 
Query OK, 1 row affected (0.00 sec) 
 
mysql> SELECT LOWER(name) FROM Person; 
+-----------------+ 
| LOWER(name)     | 
+-----------------+ 
| günter andré  | 
| don überlingen | 
| annika kruse    | 
+-----------------+ 
3 rows in set (0.06 sec) 
 
mysql> SELECT UPPER(name) FROM Person; 
+-----------------+ 
| UPPER(name)     | 
+-----------------+ 
| GÜNTER ANDRÉ  | 
| DON ÜBERLINGEN | 
| ANNIKA KRUSE    | 
+-----------------+ 
3 rows in set (0.03 sec) 
 
mysql> SELECT name FROM Person WHERE LOWER(name) LIKE lower('annika%'); 
+--------------+ 
| name         | 
+--------------+ 
| Annika Kruse | 
+--------------+ 
1 row in set (0.00 sec) 
 
mysql> SELECT name FROM Person WHERE LOWER(name) LIKE lower('GÜNTER%'); 
+----------------+ 
| name           | 
+----------------+ 
| Günter André | 
+----------------+ 
1 row in set (0.00 sec) 
 
mysql> SELECT name FROM Person WHERE UPPER(name) LIKE UPPER('günter%'); 
+----------------+ 
| name           | 
+----------------+ 
| Günter André | 
+----------------+ 
1 row in set (0.00 sec) 
 
So far everything is OK, but the case insensitive search still won't return 
anything: 
 
mysql> SELECT name FROM Person WHERE name like 'don ü%'; 
Empty set (0.01 sec) 
 
mysql> select name from Person where name like 'GÜNTER%'; 
Empty set (0.00 sec) 
mysql> select name from Person where name like 'günter%'; 
+----------------+ 
| name           | 
+----------------+ 
| Günter André | 
+----------------+ 
1 row in set (0.00 sec) 
 
Anyway the lower() and upper() functions will solve my problem, so thank you 
very much for your help! 
If you need me to do any other tests I'll be glad to help you :) 
 
Best regards, 
Denitsa
[17 Oct 2004 7:58] Denitsa Pelova
I forgot to try with that: 
mysql> SELECT name FROM Person WHERE name like 'don ü%' collate 
utf8_unicode_ci; 
+-----------------+ 
| name            | 
+-----------------+ 
| Don Überlingen | 
+-----------------+ 
1 row in set (0.10 sec) 
 
mysql> select name from Person where name like 'günter%' collate 
utf8_unicode_ci; 
+----------------+ 
| name           | 
+----------------+ 
| Günter André | 
+----------------+ 
1 row in set (0.00 sec) 
 
So to make the case insensitive search work I have to add the collation at the 
end of the query... 
Thanks again!
[17 Oct 2004 15:30] Alexander Barkov
Denitsa, thank you very much for your tests. 
It is something wring with utf8_general_ci,
whith is the default collation for utf8. I'll try
to fix it on Monday. Sorry for inconvenience.

As a temporary solution, not to write COLLATE in each SELECT,
you can specify collation during CREATE TABLE, like this:

create table Person (name VARCHAR(255) DEFAULT NULL)
ENGINE=MyISAM  DEFAULT CHARSET=utf8 COLLATE utf8_unicode_ci;

After that, all comparisons and LIKEs between column "name"
and a string constant will be done according to utf8_unicode_ci,
which seems to work fine with diactrics.
[18 Oct 2004 6:45] Alexander Barkov
Denitsa, I found the reason of this bug in the code, and will try to fix it.

Did COLLATE utf8_unicode_ci in CREATE TABLE help you?

I can suggest another solution. If you need ONLY German language,
then UTF8 is not really necessary and it is possible to use latin1
in the database. latin1 is usually faster. However, if you want
to use many languages at the same time, then latin1 will not work.

If you are interested, we can tell how to configure to store data
in latin1, but at the same time use utf8 for client-server comminication.
[18 Oct 2004 10:28] Alexander Barkov
Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html
[18 Oct 2004 11:38] Denitsa Pelova
Hello Alexander, 
actualy I never wanted to use utf-8, I did it because the latin1 didn't work. 
Before that I was working with the default character-set/collation but because 
of the problem and after the bug report, Miguel Solorzano answered to me and 
proposed to check bug #6043 if it is the same as mine, where utf8 was used, 
and then you answered me with explanations on how to do it with utf-8.  
I will be happy to know how it works with latin1. Now I have configured my 
my.cnf file so that the mysqld and mysql are always started using utf8. That 
way I don't have to change the CREATE TABLE statements one by one, because 
none of them specifies character set, but they just use the default. 
Regards, 
Denitsa