Bug #37571 Inconsistence in Danish collations
Submitted: 22 Jun 2008 17:26 Modified: 23 Jun 2008 20:36
Reporter: Peter Laursen (Basic Quality Contributor) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Charsets Severity:S4 (Feature request)
Version:4.1,5.0,5.1,6.0 OS:Any
Assigned to: Assigned Account CPU Architecture:Any
Tags: qc

[22 Jun 2008 17:26] Peter Laursen
Description:
We come from here: http://bugs.mysql.com/37555

The latin1 Danish collation and the unicode Danish collations treat the double character sequence 'aa' differently.

How to repeat:
set names utf8;
select 'aa' > 'bb' collate utf8_general_ci; -- returns 0
select 'aa' > 'bb' collate utf8_danish_ci; -- returns 1

set names latin1;
select 'aa' > 'bb' collate latin1_swedish_ci; -- returns 0
select 'aa' > 'bb' collate latin1_danish_ci; -- returns 0

Suggested fix:
I have written a few times before here that with the Danish language reform of 1953 'aa' does no mean 'å' anymore in *official Danish language*.

In *practical Danish language* this is hard to handle, though.  Because the old form is still common in names (person names in particular - but also geographcial names).  But not always!

Examples: 
1: 'Aage' = 'Åge', 'Claas' != 'Clås'
2: 'Aalborg' = 'Ålborg' but 'Aachen' != 'Åchen'

There probably is no other perfect solution than implemeting a collation that uses a dictionary lookup (from some resource using regular expressions for instance).  And if such problems occur with Danish this is likely 'overkill' (at least in the eyes of non-Danish people!).

But current latin1_danish_ci is actually 'latin1_modern_danish_ci' and current utf8_danish_ci is 'utf8_traditional_danish_ci'  This is inconsistent and will cause issues when users are 'upgrading' their applciations from latin1/ANSI to Unicode.

At least they should be consistent and 'traditional' vs. 'modern' collations should be available for Danish so that this can be managed.
[22 Jun 2008 17:47] Peter Laursen
I meant

And if such problems *ONLY* occur with Danish this is ...
[22 Jun 2008 18:51] Peter Laursen
This is an official Danish resource:
http://www.dsn.dk/nfs/2002-3.htm

So - my mistake - it was actually already in 1947 that 'å' officially replaced 'aa'.

But 1st edition after that of the official Danish language dictionary came in 1953 I believe!
[22 Jun 2008 19:07] Peter Laursen
more inconsistencies!

'ä' and 'ö' are in Danish alphabetized as 'æ' and 'ø'.
(æ and ø are barely the Danis glyphs for the same phonems that in German and Swedish are written ä and ö).

but 

set names utf8;
select 'æ' = 'ä' collate utf8_danish_ci; -- 1
select 'ø' = 'ö' collate utf8_danish_ci; -- 1

set names latin1;
select 'æ' = 'ä' collate latin1_danish_ci; -- 0
select 'ø' = 'ö' collate latin1_danish_ci; -- 0
[22 Jun 2008 22:20] Trudy Pelzer
It is true that sometimes two collations with
similar names have slightly different rules.
We are aware of this long-standing situation.

I have changed the Severity to S4 (Feature Request).

See also MySQL Character Set Forums thread
"Danish/Norwegian collation and 'aa' = 'å' when sorting"
http://forums.mysql.com/read.php?103,62869,62869#msg-62869
and
"Sorting norwegian names problem aa, a, b, æ, ø, å... aa is
sorted before a, should be like å"
http://forums.mysql.com/read.php?103,183102,183102#msg-183102
[23 Jun 2008 12:37] Susanne Ebrecht
Peter,

many thanks for writing a bug report.

I will analyse this.

One of topic comment here:
2: 'Aalborg' = 'Ålborg' but 'Aachen' != 'Åchen'

å is pronounced as a kind of "o" and Aachen in lower German (and in slang) is "Oche". 
But of course you are right, the town Aachen never would accept that it's name is written with 'Å'.
[23 Jun 2008 13:36] Susanne Ebrecht
Verified as described on bzr tree of 5.0, 5.1 and 6.0.

I also tested utf8_swedish_ci, latin1_swedish_ci, latin1_german1_ci, latin1_german2_ci and utf8_unicode_ci.

This non consequent behaviour is only in latin1_danish_ci and utf8_danish_ci.

Latin1 and UTF8 collation with name: language[X]_ci should behave the same way for the same letters.

In my eyes here is a bug either in latin1_danish_ci or in utf8_danish_ci.

Danish sorting rule is: 'å' should be after the alphabet ... means: z, æ, ø, å.
Also 'å' <=> aa.
In special names like Aachen, Saar, Maas, Aaltje (a Frisian female pre name) or Aaron the 'aa' won't be replaced with 'å'.

For sorting 'aa' will be sorted together with 'å' after 'å' means i.e.:
Åge, Aage, Åmund, Aamund, ...

So this means the bug here is in latin1_danish_ci.
[23 Jun 2008 13:38] Peter Laursen
I think a *fully satisfactory* solution for the 'aa'/'å' problems in Danish (and Norwegian?) is hard to achive.  Only there should be collations so that users have the choice!

But I think it is a plain bug that ('ä' != 'æ') and and ('ö' != 'ø') in one of the collations!
[23 Jun 2008 13:39] Peter Laursen
@Suzanne

you should check with Norwegian collations too, I think!
[23 Jun 2008 13:47] Susanne Ebrecht
Peter,

I couldn't find a separate Norwegian collation in MySQL 5.0.

And yes, this is already on our todo list together with German2 UTF8 collation and lots of others.
[23 Jun 2008 20:11] Peter Laursen
I found some more fun stuff by experimenting with some characters that are more 'exotic' to Danish!

-- latin1 examples
set names utf8;
select 'ü' = 'y' collate utf8_danish_ci; -- 1
select 'ñ' = 'n' collate utf8_danish_ci; -- 1

-- but now the inconsequence
set names latin1;
select 'ü' = 'y' collate latin1_danish_ci; -- 0
select 'ñ' = 'n' collate latin1_danish_ci; -- 0

-- some latin2 characters
-- hazek (Czech/Slovak etc.) and polish 'ł' are treated inconsequently
select 'ž' = 'z' collate utf8_danish_ci; -- 1 ;
select 'ř' = 'r' collate utf8_danish_ci; -- 1
select 'ł' = 'l' collate utf8_danish_ci; -- 0

(I know that Polish 'ł' is basically comparable to latin1 'w' (why not let 'ł' = 'w' then, if it cannot be = 'l' ?) - and I also know the Czech letters. I believe that 'ř' is so much different from 'r' that equalisation makes no sense.  

If the argument that 'ř' = 'r' is for caligraphical/typograchical reasons then also 'ł' = 'l'!  And if phonetics matters then 'ł' = 'w' and 'ř' = nothing!!
But I realize that this is inherited from utf8_general_ci and utf8_unicode_ci like that. So this is a more general problem!

And btw: what about select 'ï' = 'i' collate utf8_turkish_ci; -- returns 1
What do Turkish people say about that? :-)
[23 Jun 2008 20:36] Peter Laursen
Issue with 'ï' in Turkish reported here: 
http://bugs.mysql.com/bug.php?id=37595
[24 Jun 2008 10:21] Susanne Ebrecht
Peter,

many thanks for your comment here.

Because of all this different country rules for sorting, we have to treat every collation separate.

So, it was ok, that you opened a second bug for Turkish.

This bug here only will look for Danish.

I already gave my colleagues links where they can read more about Danish rules.

Unfortunately, I only found description in Danish and in German not in English. 
That's why I didn't add them here public.

The most necessary stuff for us I translated and pasted into one of my above comments.
[8 Jan 2010 21:12] Peter Gulutzan
WL#5213 Danish collation
http://forge.mysql.com/worklog/task.php?id=5213