MySQL Bugs: #27345: Incorrect data returned when range-read from utf8_danish

Bug #27345	Incorrect data returned when range-read from utf8_danish_ci indexes
Submitted:	21 Mar 2007 17:03	Modified:	3 Jul 2007 19:38
Reporter:	Domas Mituzas	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S2 (Serious)
Version:	5.0-bk, 5.1-bk	OS:	Any
Assigned to:	Alexander Barkov	CPU Architecture:	Any
Tags:	character sets

Description:
Entries disappear if index-based LIKE on utf8_danish_ci column is executed

How to repeat:
mysql> create table aaa ( a varchar(255), key aaaaa(a) ) character set utf8 collate utf8_danish_ci;
Query OK, 0 rows affected (0.03 sec)

mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql>  insert into aaa values ('åaaaa'),('ååaaa'),('aaaaa');
Query OK, 3 rows affected (0.03 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> select * from aaa where a like 'aa%';
+-------+
| a     |
+-------+
| aaaaa | 
+-------+
1 row in set (0.00 sec)

mysql> select * from aaa where a like 'aaa%';
Empty set (0.00 sec)

mysql> select * from aaa where a like 'aaaa%';
+-------+
| a     |
+-------+
| aaaaa | 
+-------+
1 row in set (0.00 sec)

mysql> select * from aaa where a like '%';
+---------+
| a       |
+---------+
| aaaaa   | 
| åaaaa  | 
| ååaaa | 
+---------+
3 rows in set (0.00 sec)

mysql> select * from aaa where a like 'a%';
Empty set (0.00 sec)

mysql> select * from aaa ignore index(aaaaa) where a like 'a%';
+-------+
| a     |
+-------+
| aaaaa | 
+-------+
1 row in set (0.01 sec)

Suggested fix:
http://uncyclopedia.org/wiki/AAAAAAAAA!

Is it really bug?
"å" (U+00A5) is a character of swedish alphabet too, and Domas' test works same with COLLATE utf8_swedish_ci. For collate utf8_general_ci the result is different:

mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> create table t1 (c1 char(1) character set utf8 collate utf8_general_ci);
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t1 values ('å'),('a');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select * from t1 where c1 like 'a';
+------+
| c1   |
+------+
| å   | 
| a    | 
+------+
2 rows in set (0.00 sec)

mysql> alter table t1 modify c1 char(1) character set utf8 collate utf8_swedish_ci;
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select * from t1 where c1 like 'a';
+------+
| c1   |
+------+
| a    | 
+------+
1 row in set (0.00 sec)

The biggest problem here apparently is that 'å' is same as 'aa', what makes situations very very ambiguous (like index lookups become failing). 

instead of 'å' being expanded to 'aa', 'aa' is being contracted to 'å', hence string 'aaaaa' is missed by LIKE 'a%' (as the search string is not å%, but resulted index string starts with å).

This could be written off as a feature, though one would have to note, that any string containing double-a will have troubles if looked up in any way where range scan based on single-a is made. 

Oh, and now I'm lost and don't know what would be better way to handle it.

In a discussion between Gleb and Bar and Peter, it was agreed
that this is a bug. There is a comparable situation:
in some German collations, sharp s ('ß') = 'ss', so
there is a precedent. An extract from one of the emails:
"
Now, it is true that LIKE 's%' should not find 'ß'
and we have discussed that often. In order to bore
you, I have copied emails from a recent conversation
about that to the end of this email. But of course
LIKE 's%' should find 'sssss' in German, and it does.

Domas is (I think) (I hope) only complaining that
LIKE 'a%' should find 'aaaaa', and I have to agree
that it should. Using indexes shouldn't change results.

So it's a bug.
"

I have forwarded one of the emails to Domas.

The same problem happens with ucs2_danish_ci

The same problem happens with Czech:

set names utf8;
drop table if exists t1;
create table t1 (
  a varchar(255),
  key a(a)
) character set utf8 collate utf8_czech_ci;
-- Czech rules: & H < ch <<< Ch <<< CH
-- 'ch' is a single letter between 'h' and 'i'
insert into t1 values
('b'),('c'),('d'),('e'),('f'),('g'),('h'),('ch'),('i'),('j');
select * from t1 ignore index(a) where a like 'c%';
select * from t1 where a like 'c%';
drop table t1;

Select with 'ignore index' finds both 'c' and 'ch'.
Select using index finds only 'c' and does not find 'ch', which is wrong.

The same problem happens with Traditional Spanish 'll',
which is a single letter between 'l' and 'm':

set names utf8;
drop table if exists t1;
create table t1 (
  a varchar(255),
  key a(a)
) character set utf8 collate utf8_spanish2_ci;
insert into t1 values ('aaaaa'),('lllll'),('zzzzz');
select a as like_l from t1 where a like 'l%';         -- doesn't return results
select a as like_ll from t1 where a like 'll%';
select a as like_lll from t1 where a like 'lll%';     -- doesn't return results
select a as like_llll from t1 where a like 'llll%';
select a as like_lllll from t1 where a like 'lllll%'; -- doesn't return results
drop table t1;

If you drop index 'a', then all selects return results.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/29815

ChangeSet@1.2499, 2007-06-28 13:34:44+05:00, bar@mysql.com +5 -0
  Bug#27345 Incorrect data returned when range-read from utf8_danish_ci indexes
  Problem: like_range() returned wrong ranges for contractions (like 'ch' in Czech').
  Fix: adding a special code to handle tricky cases:
  - contraction head followed by a wild character
  - full contraction
  - contraction part followed by another contraction part,
    but they are not a contraction together.

Pushed into 5.0.46-rpl
Pushed into 5.1.21-rpl

Pushed into 5.0.46

Pushed into 5.1.21-beta

Noted in 5.0.46, 5.1.21 changelogs.

Changelog entry:

Index-based range reads could fail for comparisons that involved
contraction characters (such as ch in Czech or ll in Spanish).

I have a problem. When I select 'vår' from a database (utf8, utf8_danish_ci) it returns both 'var' and 'vår', which are completely different. Also selection using 'order by', mixes the characters 'a' and 'å'. In the norwegian alphabeth a is the first character and å is the last one. 

Å is sometimes written AA in Norway, e.g. in some people's names, but should be separate. Otherwise AA is almost only used as Å if people don't have norwegian keys on their keyboard. You will never, ever see anyone write AA instead of Å in a newspaper or a book.

And just one 'a' is never the same as 'å', even if sometime names have 'aa' instead of 'å' and people without norwegian keyboards sometimes write 'aa' where it should have been 'å'.

To give an example where I compare to english characters:
If you have a database with product names, and select the product-id for "ball", then you don't want the product-id for "bell" back. It's a huge difference between a ball and a bell. Such cases could happen if 'a' and 'å' are treated as the same character.

The problem I reported, where single 'å' are being mixed with 'a' could be caused by something else. 

When I used a select inside a stored procedure, it only returned 'å' when selecting 'å'.

When I used a "normal" select, it returned three values: 'a', 'à' and 'å'.

All tables, database and server are set to UTF8 and utf8_danish_ci. I also used "SET NAMES 'utf8' COLLATE 'utf8_danish_ci';" to make sure that also the connection used the same settings. I used "show variables like '%coll%';" to check the actual collations and also used "charset utf8" to make sure that utf8 was used from mysql command line. Also got the same results from MySql Query Browser on Linux and also when using Mono (C#) code.

Any ideas why?

Changing the queries by adding 'collate utf8_danish_ci' after all varchar-fields worked, but it shouldn't be necessary when everything else (including the connection) already was configured to use utf_danish_ci.

Hello Jo,

I can't reproduce the problem.
I tried this test script and I believe it return
correct results:

mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> create table t1 (a varchar(1) character set utf8 collate utf8_danish_ci);
Query OK, 0 rows affected (0.02 sec)

mysql> insert into t1 values ('a'),('å'),('z'),('à');
Query OK, 4 rows affected (0.00 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> select hex(a), a from t1;
+--------+------+
| hex(a) | a    |
+--------+------+
| 61     | a    |
| C3A5   | å    |
| 7A     | z    |
| C3A0   | à    |
+--------+------+
4 rows in set (0.01 sec)

mysql> select a from t1 order by a;
+------+
| a    |
+------+
| a    |
| à    |
| z    |
| å    |
+------+
4 rows in set (0.00 sec)

mysql> select a from t1 where a='a';
+------+
| a    |
+------+
| a    |
| à    |
+------+
2 rows in set (0.00 sec)

mysql> select a from t1 where a='å';
+------+
| a    |
+------+
| å    |
+------+
1 row in set (0.00 sec)

Please post a script which reproduces wrong behavior.

Thank you for answering so soon, Alexander Barkov.

I have found a solution. The problem was that the collation wasn't set for the column, only the table, server, connection and so on. By using 'alter table' and changing the specific column to use utf8_danish_ci the problem disappeared, and I didn't have to use collate-keyword in the queries.

Using the collate-keyword for varchar/char-columns in the queries prevented the correct index to be used and made the queries extremely slow. After altering the column to the correct collation, I no longer had to use the keyword in the queries to get the correct results.

I get wrong sorting. should be a,b,c,d,...,æ,ø,å... but looks like it sorts, ø,æ,å,a,b,c,d,.....

my database:
mysql> show create database raymonds;
+----------+--------------------------------------------------------------------
----------------------+
| Database | Create Database
                       |
+----------+--------------------------------------------------------------------
----------------------+
| raymonds | CREATE DATABASE `raymonds` /*!40100 DEFAULT CHARACTER SET utf8 COLL
ATE utf8_danish_ci */ |
+----------+--------------------------------------------------------------------
----------------------+
1 row in set (0.00 sec)

My tables are all like this:
mysql> show create table Fylke;
+-------+-----------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
---------------+
| Table | Create Table

                |
+-------+-----------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
---------------+
| Fylke | CREATE TABLE `Fylke` (
   `fylkesnr` char(2) collate utf8_danish_ci NOT NULL,
   `f_navn` varchar(30) collate utf8_danish_ci default NULL,
   `info` varchar(2500) collate utf8_danish_ci default NULL,
   `internettside` varchar(100) collate utf8_danish_ci default NULL,
   `fylkesmann` varchar(50) collate utf8_danish_ci default NULL,
   `fylkesordforer` varchar(50) collate utf8_danish_ci default NULL,
   PRIMARY KEY  (`fylkesnr`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_danish_ci |
+-------+-----------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
---------------+
1 row in set (0.00 sec)

A query gets like this:
mysql> SELECT f_navn FROM Fylke ORDER BY f_navn;
+------------------+
| f_navn           |
+------------------+
| Østfold         |
| Akershus         |
| Aust-Agder       |
| Buskerud         |
| Finnmark         |
| Hedmark          |
| Hordaland        |
| Møre og Romsdal |
| Nord-Trøndelag  |
| Nordland         |
| Oppland          |
| Oslo             |
| Rogaland         |
| Sør-Trøndelag  |
| Sogn og Fjordane |
| Telemark         |
| Troms            |
| Vest-Agder       |
| Vestfold         |
+------------------+
19 rows in set (0.00 sec)

The only way I have found to get correct sorting is to ORDER f_navn COLLATE utf8_bin.
Like this:
mysql> SELECT f_navn FROM Fylke ORDER BY f_navn COLLATE utf8_bin;
+------------------+
| f_navn           |
+------------------+
| Akershus         |
| Aust-Agder       |
| Buskerud         |
| Finnmark         |
| Hedmark          |
| Hordaland        |
| Møre og Romsdal |
| Nord-Trøndelag  |
| Nordland         |
| Oppland          |
| Oslo             |
| Rogaland         |
| Sogn og Fjordane |
| Sør-Trøndelag  |
| Telemark         |
| Troms            |
| Vest-Agder       |
| Vestfold         |
| Østfold         |
+------------------+
19 rows in set (0.00 sec)