Bug #12227 select * from x where col like '%ä%' doesn't work with UTF8
Submitted: 27 Jul 2005 20:54 Modified: 29 Jul 2005 6:57
Reporter: Markus Popp Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server Severity:S4 (Feature request)
Version:4.1.13, 5.0.10 OS:Windows (Windows, Linux)
Assigned to: CPU Architecture:Any

[27 Jul 2005 20:54] Markus Popp
Description:
This feature request is based on bug report
http://bugs.mysql.com/bug.php?id=12191, which has been marked as 'not a bug':

If I've got a table with charset UTF8, I can't use the LIKE clause for special
characters like e.g. German umlauts.

The server default is utf8 and collation utf8_general_ci.

There's a second problem with PHP (don't know whether it's really a MySQL bug,
but it might relate to this problem). I've tried following script against the
same data in the 'how to repeat' section:

<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf8" />
</head>
<body>
<?php
$link = mysql_connect('db4free.org', 'xxx', 'yyy');
mysql_select_db('zzz', $link) or die ('Keine Verbindung!');

$sql = "select land, waehrung from waehrungen where land = 'Ägypten' order by
land";

$query = mysql_query($sql, $link);

while ($data = mysql_fetch_row($query))
{
echo $data[0].": ".$data[1]."<br />\n";
}

mysql_close($link);
?>
</body>
</html>

On my local server with PHP 5.0.4 and API client version 4.1.7 I get the correct
result: 'Ägypten'. If I run the same script on my webhoster's server, who uses
PHP 5.0.2 and API client version 4.0.18, I don't get any result. However, I do
get the correct results, if I leave the WHERE clause away. The MySQL server was
in both cases a remote server, running 5.0.10 on Linux.

How to repeat:
Create following table:

CREATE TABLE `waehrungen` (
  `bezeichnung` varchar(50) NOT NULL default '',
  `land` varchar(50) NOT NULL default '',
  `iso` varchar(3) NOT NULL default '',
  `waehrung` varchar(100) NOT NULL default '',
  PRIMARY KEY  (`bezeichnung`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

INSERT INTO `waehrungen` VALUES ('afghanistan', 'Afghanistan', 'AFA',
'Afghani');
INSERT INTO `waehrungen` VALUES ('albania', 'Albanien', 'ALL', 'Lek');
INSERT INTO `waehrungen` VALUES ('egypt', 'Ägypten', 'EGP', 'Ägyptisches
Pfund');
INSERT INTO `waehrungen` VALUES ('estonia', 'Estland', 'EEK', 'Estnische
Krone');
INSERT INTO `waehrungen` VALUES ('ethiopia', 'Äthiopien', 'ETB', 'Birr');
INSERT INTO `waehrungen` VALUES ('euro', 'Europäische Union', 'EUR', 'Euro');
INSERT INTO `waehrungen` VALUES ('fiji', 'Fidschi', 'FJD', 'Fidschi Dollar');

The query

select land, waehrung from waehrungen where land like '%x%' order by land

returns correctly 'Mexiko'. However, the query 

select land, waehrung from waehrungen where land like '%ä%' order by land

returns all rows except the one with 'Fidschi'.

Suggested fix:
Queries with special characters like umlauts should work the same way as with
'normal' characters.

Suggested fix:
Client and server character sets are both set to utf8. In this case, I would expect, that e.g. an 'ü' (umlaut u) and other special characters is treated as such. MySQL seems to treat an 'ü' like a 'u'.

If I've got 2 rows with the values 'Tunesien' and 'Türkei' and I query against 'Tü%', I get both records as result, although I would expect to get only 'Türkei' returned.
[28 Jul 2005 12:14] Marko Mäkelä
Are you sure that the non-ASCII characters in the script are encoded in utf8 and not e.g. latin1? Do something like SELECT HEX('ü') in the script and compare the result against a hex dump (od -t x1) of the script.
[28 Jul 2005 12:25] Markus Popp
I've tested the following in MySQL query browser.

When I make 'show variables', all character settings are 'utf8' and collation is generally 'utf8_general_ci'. Also my 'waehrungen' table is set to utf8, so nothing else should be involved here anywhere.

select hex('ü');

returns 'C3BC'. If I extract the 'ü' from 'Türkei' (from the 'waehrungen' table) and put it into the hex-function:

SELECT hex(substring(land, 2, 1)) FROM waehrungen

I also get 'C3BC'.
[28 Jul 2005 13:04] Marko Mäkelä
If you submit the query to the query browser, will it then work properly? If yes, then it is probable that the character set has been misconfigured on the PHP side. Can you execute those queries in a PHP script?
[28 Jul 2005 15:14] Markus Popp
No, I've tested those examples with QueryBrowser and it returns both rows, although I should only get 'Türkei'.
[29 Jul 2005 6:57] Marko Mäkelä
Sorry, I didn't read the original description carefully enough. It wasn't a character encoding issue after all.

In the default collation of utf8, the umlauted characters are equivalent to the non-umlaut ones. You can force an umlaut-sensitive collation, e.g., like this:

CREATE TABLE `waehrungen` (
  `bezeichnung` varchar(50) NOT NULL default '',
  `land` varchar(50) NOT NULL default '' collate utf8_bin,
  `iso` varchar(3) NOT NULL default '',
  `waehrung` varchar(100) NOT NULL default '',
  PRIMARY KEY  (`bezeichnung`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

See the documentation for possible collations. utf8_bin is most probably not what you really want.