MySQL Bugs: #21061: Wrong rows returned when using unicode

Bug #21061	Wrong rows returned when using unicode
Submitted:	14 Jul 2006 14:07	Modified:	19 Jul 2006 11:02
Reporter:	[ name withheld ]	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server	Severity:	S3 (Non-critical)
Version:	5.0.22	OS:	Windows (Windows XP)
Assigned to:		CPU Architecture:	Any
Tags:	Unicode, utf8

Description:
When a table contains rows with the same text, except for Umlauts, a SELECT returns incorrect data.

How to repeat:
Create a table:

  CREATE TABLE `person` (
    `name` text NOT NULL
  ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Insert some data:

  INSERT INTO `person` (`name`) VALUES
   ('Hugo'),
   ('Hügö');

Perform the following select:

  SELECT * FROM person WHERE name='Hugo';

Bug: The returned resultset contains Hugo AND Hügü.  The same resultset is returned when searching for 'Hügö', 'Hügo', and 'Hugö'

Thank you for the bug report. Please try to use set names, i.e:

C:\mysql\bin>mysql -uroot test
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 6 to server version: 5.0.23-community-nt

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> set names cp1251;
Query OK, 0 rows affected (0.11 sec)

mysql> CREATE TABLE `person` (
    ->     `name` text NOT NULL
    ->   ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.61 sec)

mysql> INSERT INTO `person` (`name`) VALUES
    ->    ('Hugo'),
    ->    ('Hügö');
Query OK, 2 rows affected (0.11 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM person WHERE name='Hugo';
+------+
| name |
+------+
| Hugo |
+------+
1 row in set (0.05 sec)

mysql> SELECT * FROM person WHERE name='Hügö';
+------+
| name |
+------+
| Hügö |
+------+
1 row in set (0.00 sec)

mysql>

Thanks in advance.

MySQL Query Browser Screen Shot

Attachment: QB.png (image/png, text), 10.63 KiB.

Hmm, but I do not want to use cp1251.  I would like to use utf8, and when using
  SET NAMES utf8;
it does not work again.

When I try your suggestion cp1251, MySQL Query Browser shows badly encoded characters (see attachment QB.png).

Am I right, that when using utf8, ü and u are indistinguishable in SELECTs when using MySQL?

Thank you for the feedback. What I tried to show you that isn't a server
issue instead a terminal issue which not accepts the correct encoding for
to send to the server. Below using a terminal which supports utf8 I don't
have problems even I don't need to use set names.

miguel@hegel:~/dbs/5.0> bin/mysql -umiguel -p -h192.168.0.33
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2 to server version: 5.0.23-community-nt

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> create database dbw;
Query OK, 1 row affected (0.04 sec)

mysql> use dbw
Database changed
mysql> CREATE TABLE `person` (
    ->     `name` text NOT NULL
    ->   ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.13 sec)

mysql>
mysql> INSERT INTO `person` (`name`) VALUES
    ->    ('Hugo'),
    ->    ('Hügö');
Query OK, 2 rows affected (0.07 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM person WHERE name='Hugo';
+------+
| name |
+------+
| Hugo |
+------+
1 row in set (0.00 sec)

mysql> SELECT * FROM person WHERE name='Hügö';
+--------+
| name   |
+--------+
| Hügö |
+--------+
1 row in set (0.00 sec)

mysql>

Thank you for your response.

When I use the GUI client "MySQL Query Browser" I get the resultset shown in the attachment QB_utf8.png.  I thought this client is fully utf8 aware.  Was this assumption wrong?

Screen Shot of MySQL Query Browser

Attachment: QB_utf8.png (image/png, text), 11.51 KiB.

Problem solved:

* First, you are right that it is not a bug.
* Second, I do not think the problem was related to using a terminal which is or is not able so send unicode, since the Windows GUI program "MySQL Query Browser" is IMO perfectly unicode aware.  The problem of treating e.g. 'ü' and 'u' as the same characters in SELECTs is related to the COLLATE used.  When SELECTing using a collate, it works ok:

SELECT * FROM person WHERE name='Hugo' COLLATE utf8_swedish_ci;

The same can be achived when giving a collate at table creation:

CREATE TABLE `person` (
  `name` text NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE utf8_swedish_ci;

I found, that the following utf8 collations treat 'ü' and 'u' as different characters:
* utf8_danish_ci
* utf8_estonian_ci
* utf8_hungarian_ci
* utf8_icelandic_ci
* utf8_roman_ci
* utf8_swedish_ci
* utf8_turkish_ci

Note: Please do not think I write this down to teach you.  I am posting my findings so that others who might read this report have a solution.

Kind regards, and be well,
Markus

P.S.: Worth to read: http://dev.mysql.com/doc/refman/5.0/en/charset.html