MySQL Bugs: #57877: Wrong search result on å character with utf-8

Bug #57877	Wrong search result on å character with utf-8
Submitted:	31 Oct 2010 23:07	Modified:	9 Nov 2010 22:39
Reporter:	Vegard Hanssen	Email Updates:
Status:	Unsupported	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S3 (Non-critical)
Version:	5.1.51	OS:	Linux
Assigned to:		CPU Architecture:	Any
Tags:	utf8

Description:
If you set a table/database to utf-8 and do a search on character 'å', you will also get the character 'a'.

This is tested on version 5.1.51 and 5.0.77

How to repeat:
mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> create table test(id int auto_increment, vers text, primary key(id)) character set utf8 collate utf8_general_ci;
Query OK, 0 rows affected (0.17 sec)

mysql> insert into test (vers) values ('små');
Query OK, 1 row affected (0.18 sec)

mysql> insert into test (vers) values ('sma');
Query OK, 1 row affected (0.00 sec)

mysql> select * from test where vers like '%små%';
+----+------+
| id | vers |
+----+------+
|  1 | små |
|  2 | sma  |
+----+------+
2 rows in set (0.00 sec)

mysql> select * from test where vers like '%sma%';
+----+------+
| id | vers |
+----+------+
|  1 | små |
|  2 | sma  |
+----+------+
2 rows in set (0.00 sec)

Suggested fix:
Only workaround is to go back to Latin-1 for character set. That works correct.

You will need to use a Danish collation to handle Dansih/Norwegina 'å' properly! Refer:

SELECT STRCMP((SELECT 'a' COLLATE utf8_general_ci),(SELECT 'å' COLLATE utf8_general_ci)); -- '0' means identical
SELECT STRCMP((SELECT 'a' COLLATE utf8_danish_ci),(SELECT 'å' COLLATE utf8_danish_ci)); -- '-1' different
SELECT STRCMP((SELECT 'aa' COLLATE utf8_danish_ci),(SELECT 'å' COLLATE utf8_danish_ci)); -- '0' means identical

The utf8_general_ci and utf8_unicode_ci collations does not distinguish a and å (and also not ä and more accented a-variants):

SELECT STRCMP((SELECT 'ä' COLLATE utf8_general_ci),(SELECT 'a' COLLATE utf8_general_ci)); -- '0' means identical
SELECT STRCMP((SELECT 'ä' COLLATE utf8_danish_ci),(SELECT 'a' COLLATE utf8_danish_ci)); -- '-1' different

Peter
(not a MySQL person)

So you may create the table like this

CREATE TABLE test (
  id INT AUTO_INCREMENT,
  vers TEXT,
  PRIMARY KEY (id)
) CHARACTER SET utf8 COLLATE utf8_danish_ci;

or this

CREATE TABLE `test` (
  `id` INT (11) NOT NULL AUTO_INCREMENT,
  `vers` TEXT CHARACTER SET utf8 COLLATE utf8_danish_ci,
  PRIMARY KEY (`id`)
) ENGINE = INNODB DEFAULT CHARSET = utf8;

or 

write queries with a COLLATE clause(s) as documented here:
http://dev.mysql.com/doc/refman/5.0/en/charset-collate.html

Note that LIKE operator does not support COLLATE. It executes but is ignored.
SELECT 'a' LIKE 'å' COLLATE utf8_danish_ci; -- '0'
(what I think I will report as a bug myself or doc request at least)

Just FYI I have created http://bugs.mysql.com/bug.php?id=57884
(let us see what they say about this!).

one more comment: the reason whh you experience that latin1 solves this for you probably is that the default latin1 collation is 'latin1_swedish_ci'.  'å' is a native Swedish character and is handled by 'latin1_swedish_ci'.  But the default utf8 collation is not 'swedish', it is 'general'.

Can you tell me why a = å in utf8_general_ci? Shouldn't they be different anyway? There is no language where å = a, so why do you get a match?

except for Nordic languages (and maybe a few more).  'å' is a completely *unknown* letter.  It will have to be set equal to something *known* by the actual language.

This is the same in oracle for instance (though Oracle has both more and more well-designed collations than MySQL).  SQL standards define collation like that as far as I understand.

Also in Danish collation 'ü' is set equal to to 'y' the same way because 'ü' is not part of the Danish alphabet - it is a character *unknown by danish*.  But the letter may occasionally occur in foreign names etc, so some 'rule is required'. Another example is 'i' in Turkish. There are two variants: 'i with a dot' and 'i without a dot' ("ı").  Only Turkish collations distinguish those. 

Now 'general' does not really denote any specific language - and I (and others) have complained about incompabilities between default latin1 and utf8 collations before.  

I think MySQL should create a utf8_western_ci collation where every character in ANSI/western codepage is distint (and maybe make it default utf8)

If you wnat every character distinct you may use a binary collation (utf8_bin).  But then also LOWERCASES and UPPERCASES are distinct.

It does appear that the bug reporter
changed from latin1_swedish_ci to
utf8_general_ci collation.
If so, "Not a bug" will be appropriate.

I will still call this a bug, but I'm closing the case soon. I solved my problem with creating a lowercase column where I can do a utf8_bin collation search.

I find it strange that you can't have a single collation for which you can search utf8 columns. What is the point of having utf8 when you can't really populate your column with all the characters in utf8 - if you want to do a search? Someone missed the obvious benefit of creating columns to put text from all languages in.

Thank you for taking the time to write to us, but this is not a bug. Please double-check the documentation available at http://dev.mysql.com/doc/ and the instructions on
how to report a bug at http://bugs.mysql.com/how-to-report.php

See http://www.collation-charts.org/mysql60/mysql604.utf8_general_ci.european.html for details about how comparison works.

You can of course call it a feature and point to the documentation saying a=å, but it's still wrong. a is a different character than å. Which then force me to use a "danish" collation to do correct comparison for norwegian characters. So how are you supposed to compare a table with both norwegian and japanese characters, which is what utf8 was design for? Even norwegian and german text in the same table is a problem.

I guess what you're saying is that by design mysql doesn't support utf8 comparison, which pretty much removes the point of having utf8 columns. Go ahead and point me to the documentation saying that's how it's supposed to work. It's still a big problem for the users. Call it a feature, I call it a bug. utf8 is useless in mysql as it's now. I can store the characters, but I can't rely on user input to a search and expect a correct answer.

After thinking about this for a while, I can't agree with this as not a bug. It's a bug, and the bug is in the general collation. The letter A is never Å. In norwegian we sometimes write AA = Å, but never A=Å. And I can't find another language where A=Å. The ring above A isn't an accent or similar. As you write e è ë etc, you write the same letter e. But that's not the case for the ring above a. è is still an e, but å is not an a.

Normally you transcribe å=aa, æ=ae, ø=oe - all with two letters. And you need both letters to transcribe. 

The general collation should change. Give me one language where å=a and I'll shut up, but if you can't - it's a bug. Either å should be removed from the collation as a character not supported in general_ci, or it should be moved to it's own character. Letting it be where it is, is wrong.

Thank you for taking the time to write to us, but this is not a bug. Please double-check the documentation available at http://dev.mysql.com/doc/ and the instructions on
how to report a bug at http://bugs.mysql.com/how-to-report.php

Please carefully read our manual starting from http://dev.mysql.com/doc/refman/5.1/en/charset.html You also can look for Unicode specifications at http://unicode.org/Public/UCA/latest/allkeys.txt, but take in mind utf8_general_ci is not completely implements it.

I'm sorry to say it Sveta, but you are wrong. There is no place in the unicode definition where Å=A. And pointing to the mysql documentation is just plain stupid in this case. It's the documentation/implementation which is the bug. Your argument is circular.

Setting Å=A for us is the same as setting E=I. You have not pointed to a language, or a documentation (without mysql's own wrong one), saying that Å=A. And it's not. If you do a search in norwegian you will get words like blå=bla => blue=turn the page.

Tell me a language, or even bother pointing to why on earth mysql means Å=A - except it's own flawed one, and I'll stop changing this bug. But I guess you can't. Because it's a bug. Remove the A=Å in utf8_general_ci and you probably solve the problem. If you don't want to remove it, you should try to set E=I and see how many starts complaining.

I guess you just want me to stop using mysql since you apparently don't want to fix bugs. "Hey, we have documented in our program that 2=3, so our calculator is perfectly correct in setting 2*2=6..."

> There is no place in the unicode definition where Å=A.

Here it is:

01CE  ; [.15A3.0020.0002.0061][.0000.0041.0002.030C] # LATIN SMALL LETTER A WITH CARON; QQCM
01CD  ; [.15A3.0020.0008.0041][.0000.0041.0002.030C] # LATIN CAPITAL LETTER A WITH CARON; QQCM

Eh, it say the character _looks_ like an A with a Caron, not that it _is_ the letter A. And of course it does look like an A. But it's _not an A_.

Sorry, wrong quote:

00E5  ; [.15A3.0020.0002.0061][.0000.0043.0002.030A] # LATIN SMALL LETTER A WITH RING ABOVE; QQCM
00C5  ; [.15A3.0020.0008.0041][.0000.0043.0002.030A] # LATIN CAPITAL LETTER A WITH RING ABOVE; QQCM

Yeah, I know - but I didn't bother, since it's the same argument. It still _looks_ like the letter A. It's not the letter A.

The comments in the unicode definition isn't ment to define letters, it ment to be used to understand how the letter looks. Surely you should know this closing this bug all the time?

If the unicode defined the letter B as "LATIN CAPITAL LETTER P WITH A BULB DOWN TO THE RIGHT", you would set B as P utf8_general_ci? Probably sound like a stupid example, but it's not. The letter B and P is pronounced the same in different countries. But the letter A and Å are as B and P different letters, and isn't pronounced the same either.

Thank you for the feedback.

What you offer is implement change in generic collation which will break millions of applications which rely on current behavior while you can correctly use collation designed for particular language. We can not consider this as a bug.

Ah, oh, so now you admit it's a bug, but you can't fix it since it will break other applications. Well, I can agree on that.

But I can't correctly use another collation. I want to store several language in the same row, hm, guess utf8 was designed for that - but I can't. The only solution to my problem is to create a second row/tuple where I lowercase the first one and do a utf8_bin collation on that one. Sounds like a cool solution? Oh, did I mention I actually lost the point of collation in that solution?

So, here we sit with a flawed mysql implementation - and we will never get it fixed, because more and more will "rely" on the wrong output, and the circular argument with breaking many applications will emerge again.

Atleast have the decency to set it to unsupported instead of "not a bug".

The 'allkeys.txt' document which Ms Smirnova
referred to is the Unicode DUCET (Default Unicode
Collation Element Table), which is
(http://www.unicode.org/reports/tr10/) for
"specifying the default collation order for all
Unicode characters". It's not about looks, it's
about collation. The '/tr10' document explains
how to read the table. Ms Smirnova has pointed
to the line which shows the weight for A WITH
RING ABOVE, which is the same as the primary
weight for A.

Sure, utf8_general_ci departs from the Unicode
specification in other respects, but not here.

Ok, let's look at that document. And see how they always say "aa" => "å", and pointing out that sorting å after a would be wrong, since it's actually sorted a..z,æ,ø,å in norwegian.

And let me particular point you to this "Thus for matching it does not matter that "å" would sort after "z" in a Danish tailoring—the only relevant information is that they do not match."

So what they are saying is that å is another character than z, so "THE ONLY RELEVANT INFORMATION IS THAT THEY DO NOT MATCH". Guess what. Å sort after Z, how can you say Å should match to A?

-------------

S2.6 Loop until the end of the string is reached.

    Note: The extra non-starter C needs to be considered in Step 2.1.1 because otherwise irrelevant characters could interfere with matches in the table. For example, suppose that the contraction <a, combining_ring> (= å) is ordered after z. If a string consists of the three characters <a, combining_ring, combining_cedilla>, then the normalized form is <a, combining_cedilla, combining_ring>, which separates the a from the combining_ring. Without considering the extra non-starter, this string would compare incorrectly as after a and not after z.

Language-sensitive searching and matching are closely related to collation. Strings that compare as equal at some strength level should be matched when doing language-sensitive matching. For example, at a primary strength, "ß" would match against "ss" according to the UCA, and "aa" would match "å" in a Danish tailoring of the UCA. The main difference from the collation comparison operation is that the ordering is not important. Thus for matching it does not matter that "å" would sort after "z" in a Danish tailoring—the only relevant information is that they do not match.

The lengths of matching strings may differ: "aa" and "å" would match in Danish.

å  	→  	aa  	Map contractions (a + ring above) to equivalent values

This is not a bug.

utf8_general_ci is not suitable for Norwegian.

Please use utf8_danish_ci instead.

This is the chart for utf8_danish_ci:

http://www.collation-charts.org/mysql60/mysql604.utf8_danish_ci.html

This screenshot demonstrates desired behaviour when using utf8_danish_ci:

mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> create table test(id int auto_increment, vers text, primary key(id)) character set
    -> utf8 collate utf8_danish_ci;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into test (vers) values ('små');
Query OK, 1 row affected (0.00 sec)

mysql> insert into test (vers) values ('sma');
Query OK, 1 row affected (0.00 sec)

mysql> select * from test where vers like '%små%';
+----+------+
| id | vers |
+----+------+
|  1 | små | 
+----+------+
1 row in set (0.00 sec)

mysql> select * from test where vers like '%sma%';
+----+------+
| id | vers |
+----+------+
|  2 | sma  | 
+----+------+
1 row in set (0.00 sec)

@Alexander .. it has confused thousands of users that the default UTF8 and LATIN1 collations behave so much different with accented characters. I believe that utf8_general-ci should never have been the default uft8 collation.  It should have been another (non-existing) collation treating a, á, â, å and aa etc. (as well as u, y and ü and as well as o, oe, ö and ø etc. etc.) the same way as latin1_swedish_ci does.

You asked for this yourself by poor collation design and defaults-selection in my opinion. I think lack of multilingualism in MySQL organization is to blame.

suitable? Can you, since no one else here can, direct me to where a=å is defined anywhere in utf8? I can agree that danish collation is better to use, but you haven't said why a=å is correct in utf8_general_ci. So if a isn't å, it should be removed from utf8_general_ci -> this is actually a bug in utf8_general_ci collation.

I wonder if I should start reading more of the cases in mysql. I find it kinda funny how things are answered here with postulations, not explanations.

@Peter. I agree with you, but you shouldn't call å an accented character. å isn't an accented character to a. å is the same to a as i is to e. A different character. That's why I can't understand why å=a in a collation at all. It shouldn't.

Peter:

utf8_general_ci is good for many word languages.
For example, Spanish, French, Portuguese, Italian, to mention a few.

A collation which would make all languages happy does not exists.
If we make 'a' != 'á' in the default collation, it will fix 
the collation to suite better some languages, but it will break 
the collation for other languages.

utf8_general_ci is a simplified version of utf8_unicode_ci which
is based on the Unicode Collation Algorithm with its default
weight table (so called DUCET). It was obvious an choice
to use something DUCET based for a default Unicode collation.

I agree, it can be confusing that the default collations 
for latin1 and utf8 are different. But I would say it the other
way around: it's not good that latin1_swedich_is is the default
collation for latin1. It could have been something closer to
utf8_general_ci. But that was done for backward compatibility purposes
with pre-4.1 version.

@Vegard.  What I am trying to convey is that for non-Nordic people "a" and "å" are 'accented variants' (as they do not know what "å" is).  Just like Turkish "ı" and "i" are are 'accented variants' for non-Turkish people (as the do not know what "ı" is - for Turkish people they are *definitely not* 'accented variants'!). Continue with western-slaw haczek accents: "ř" and "r" are *definitely not* the same for a Czech/Slovak person but in Danish I will ahve to consider them identical as my language and alphabet does not recognize "ř" at all.

But for me as Danish a =  á,  å = aa, å != a, a != ä, ä = æ = ae. utf8_general_ci fails completely with mapping umlauts to Nordic characters and Nordic's æ, ø and å as well. utf8_general_ci is not even a 'western collation' (it is 'less distinctive' than ANSI/western codepage). utf8_general_ci is completely 'accent-ignorant' (and can only be used by English people and people using non-latin script (at best). 

Collations utf8_western_ci (should match ANSI/western), utf8_central_ci (should match ANSI/centraleuropean), ut8_allaredistinct_ci (names I invented myself) etc. are completely missing in MySQL

@Alexander .. you cannot seriously mean that utf8_general_ci is good for French ?????

>  [9 Nov 22:38] Vegard Hanssen
> 
> suitable? Can you, since no one else here can, direct me to where a=å is defined anywhere
> in utf8? I can agree that danish collation is better to use, but you haven't said why a=å
> is correct in utf8_general_ci. 

utf8_general_ci is a simplified version of utf8_unicode_ci,
which is based on Unicode's default weight table (DUCET version 4.0.0).
Sveta Smirnova earlier posted excerpts from DUCET demonstrating
that 'a' and 'å' are equal in the primary level.

You can find the full DUCET table here:
ftp://ftp.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt

> So if a isn't å, it should be removed from utf8_general_ci
> -> this is actually a bug in utf8_general_ci collation.

This is an intentional design, according to Unicode's DUCET table.

>  [9 Nov 22:44] Vegard Hanssen
> 
> @Peter. I agree with you, but you shouldn't call å an accented character. å isn't an
> accented character to a. å is the same to a as i is to e. A different character. That's
> why I can't understand why å=a in a collation at all. It shouldn't.

This is true, 'å' is a different character in Nordic languages, like Sweidish,
Norwegian, Danish.

However, in other languages 'å' is a just a variant of 'a'.
This is why we have different collations, to address different rules 
in different languages.

>  [9 Nov 23:05] Peter Laursen
> 
> @Alexander .. you cannot seriously mean that utf8_general_ci is good for French ?????

Well, almost it is. French is a kind of special case.
To make French people absolutely happy, we're planning
to add concept of secondary and tertiary weight levels:
http://forge.mysql.com/worklog/task.php?id=896

Peter,

> But for me as Danish a =  á,  å = aa, å != a, a != ä, ä = æ = ae. utf8_general_ci
> fails completely with mapping umlauts to Nordic characters and Nordic's æ, ø and å as
> well. 

This is why we have utf8_danish_ci.

> utf8_general_ci is not even a 'western collation' (it is 'less distinctive' than
> ANSI/western codepage). utf8_general_ci is completely 'accent-ignorant' (and can only be
> used by English people and people using non-latin script (at best). 
> 
> Collations utf8_western_ci (should match ANSI/western), utf8_central_ci (should match
> ANSI/centraleuropean), ut8_allaredistinct_ci (names I invented myself) etc. are
> completely missing in MySQL

It's not always possible to satisfy many languages by a single collation.
For example, how should utf8_western_ci sort 'Ü'?
Near 'Y', to make Sweds happy?
Or near 'U', to make Germans happy?
If you favour one language, you make it bad for another language,
because the rules are very different.

This is why we tend to have separate collations for languages,
like utf8_danish_ci, which should be suitable for Danish.

It's correct that some collations are completely missing in MySQL.
With adding WL#896 (http://forge.mysql.com/worklog/task.php?id=896)
we will be able to provide more collations.

For example, utf8_allaredistinct_ci
(I guess you mean a kind of 'total accent sensitive' here) will be possible.

Please be sure we're working on it.

@Alexander, can you point me to a language which has the character å and see it as a variant of a? With this I mean not a language where they see a character å and would think it's a. You guys have a lot of examples now on other characters. Could be smart to stick to å=a in examples since that's the character behind this case.

Vegard,

>  [9 Nov 23:39] Vegard Hanssen
> 
> @Alexander, can you point me to a language which has the character å and see it as a
> variant of a? With this I mean not a language where they see a character å and would
> think it's a. You guys have a lot of examples now on other characters. Could be smart to
> stick to å=a in examples since that's the character behind this case.

We cannot give examples for all accented letters.
We follow what the Unicode standard says here:
ftp://ftp.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt

It says: 'a' is primarily equal to 'å'

which we perfectly realize is bad for Swedish or Danish or Norwegian,
so we supply utf8_swedish_ci and utf8_danish_ci.

"how should utf8_western_ci sort 'Ü'" 

It should sort ANSI/western characters as per 'extended ASCII'/ANSI character mapping 0-255 simply. It will not be possible to *sort* in a way that satisfies everybody (read 'anybody' if you like) - but it will *distinguish* them all as distinct letters (with "=" , "<>" , "LIKE", "strcmp()" etc.). If people want to sort some specific and other way in some context they may use ORDER BY .. COLLATE .. in their queries. (but there will be non-western accents to consider so maybe not feasible exactly as proposed!).

But (a) 'case insensitive binary collation(s)' (I know the term is a contradiction in itself!) - whether with codepage-scope or global - is not possible to achieve with MySQL.  How will I distinguish 'a' and 'á' unless using a binary collation (what has unwanted LETTERCASE (and more) side-effects)?

I do not think I will post here again.  This is far from a bug report now. I agree with Veland that collation-instrumentation in MySQL utf8 is worse than latin1 (for non-English people) - but I do not agree with his premises that "a" and "å" are different for all people in the world. Those that do not speak Nordic languages and/or do not use a Nordic locale will have to consider them identical (refer also my Turkish and Czech examples).  

BTW: further discussion will need to consider greek, arabic and tibetan (!) letter for the 'a' phonetic variants! Is Greek 'alpha' = latin 'a' in utf8_general_ci .. I don't know really.  Collations with and without this would both make sense in different contexts. But then 'non-alphabetic scripts' would pose a severe problem!

Peter,

WL#896 will make many new useful things possible, including case insensitive 
accent sensitive collations, which will distinguish all accents, and will also
provide much better order than 'case insensitive binary' would do.
We're working in it.

Looking forward to that!