MySQL Bugs: #37571: Inconsistence in Danish collations

Bug #37571	Inconsistence in Danish collations
Submitted:	22 Jun 2008 17:26	Modified:	23 Jun 2008 20:36
Reporter:	Peter Laursen (Basic Quality Contributor)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S4 (Feature request)
Version:	4.1,5.0,5.1,6.0	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	qc

Description:
We come from here: http://bugs.mysql.com/37555

The latin1 Danish collation and the unicode Danish collations treat the double character sequence 'aa' differently.

How to repeat:
set names utf8;
select 'aa' > 'bb' collate utf8_general_ci; -- returns 0
select 'aa' > 'bb' collate utf8_danish_ci; -- returns 1

set names latin1;
select 'aa' > 'bb' collate latin1_swedish_ci; -- returns 0
select 'aa' > 'bb' collate latin1_danish_ci; -- returns 0

Suggested fix:
I have written a few times before here that with the Danish language reform of 1953 'aa' does no mean 'å' anymore in *official Danish language*.

In *practical Danish language* this is hard to handle, though. Because the old form is still common in names (person names in particular - but also geographcial names). But not always!

Examples:
1: 'Aage' = 'Åge', 'Claas' != 'Clås'
2: 'Aalborg' = 'Ålborg' but 'Aachen' != 'Åchen'

There probably is no other perfect solution than implemeting a collation that uses a dictionary lookup (from some resource using regular expressions for instance). And if such problems occur with Danish this is likely 'overkill' (at least in the eyes of non-Danish people!).

But current latin1_danish_ci is actually 'latin1_modern_danish_ci' and current utf8_danish_ci is 'utf8_traditional_danish_ci' This is inconsistent and will cause issues when users are 'upgrading' their applciations from latin1/ANSI to Unicode.

At least they should be consistent and 'traditional' vs. 'modern' collations should be available for Danish so that this can be managed.

I meant

And if such problems *ONLY* occur with Danish this is ...

This is an official Danish resource:
http://www.dsn.dk/nfs/2002-3.htm

So - my mistake - it was actually already in 1947 that 'å' officially replaced 'aa'.

But 1st edition after that of the official Danish language dictionary came in 1953 I believe!

more inconsistencies!

'ä' and 'ö' are in Danish alphabetized as 'æ' and 'ø'.
(æ and ø are barely the Danis glyphs for the same phonems that in German and Swedish are written ä and ö).

but 

set names utf8;
select 'æ' = 'ä' collate utf8_danish_ci; -- 1
select 'ø' = 'ö' collate utf8_danish_ci; -- 1

set names latin1;
select 'æ' = 'ä' collate latin1_danish_ci; -- 0
select 'ø' = 'ö' collate latin1_danish_ci; -- 0

It is true that sometimes two collations with
similar names have slightly different rules.
We are aware of this long-standing situation.

I have changed the Severity to S4 (Feature Request).

See also MySQL Character Set Forums thread
"Danish/Norwegian collation and 'aa' = 'å' when sorting"
http://forums.mysql.com/read.php?103,62869,62869#msg-62869
and
"Sorting norwegian names problem aa, a, b, æ, ø, å... aa is
sorted before a, should be like å"
http://forums.mysql.com/read.php?103,183102,183102#msg-183102

Peter,

many thanks for writing a bug report.

I will analyse this.

One of topic comment here:
2: 'Aalborg' = 'Ålborg' but 'Aachen' != 'Åchen'

å is pronounced as a kind of "o" and Aachen in lower German (and in slang) is "Oche". 
But of course you are right, the town Aachen never would accept that it's name is written with 'Å'.

Verified as described on bzr tree of 5.0, 5.1 and 6.0.

I also tested utf8_swedish_ci, latin1_swedish_ci, latin1_german1_ci, latin1_german2_ci and utf8_unicode_ci.

This non consequent behaviour is only in latin1_danish_ci and utf8_danish_ci.

Latin1 and UTF8 collation with name: language[X]_ci should behave the same way for the same letters.

In my eyes here is a bug either in latin1_danish_ci or in utf8_danish_ci.

Danish sorting rule is: 'å' should be after the alphabet ... means: z, æ, ø, å.
Also 'å' <=> aa.
In special names like Aachen, Saar, Maas, Aaltje (a Frisian female pre name) or Aaron the 'aa' won't be replaced with 'å'.

For sorting 'aa' will be sorted together with 'å' after 'å' means i.e.:
Åge, Aage, Åmund, Aamund, ...

So this means the bug here is in latin1_danish_ci.

I think a *fully satisfactory* solution for the 'aa'/'å' problems in Danish (and Norwegian?) is hard to achive.  Only there should be collations so that users have the choice!

But I think it is a plain bug that ('ä' != 'æ') and and ('ö' != 'ø') in one of the collations!

@Suzanne

you should check with Norwegian collations too, I think!

Peter,

I couldn't find a separate Norwegian collation in MySQL 5.0.

And yes, this is already on our todo list together with German2 UTF8 collation and lots of others.

I found some more fun stuff by experimenting with some characters that are more 'exotic' to Danish!

-- latin1 examples
set names utf8;
select 'ü' = 'y' collate utf8_danish_ci; -- 1
select 'ñ' = 'n' collate utf8_danish_ci; -- 1

-- but now the inconsequence
set names latin1;
select 'ü' = 'y' collate latin1_danish_ci; -- 0
select 'ñ' = 'n' collate latin1_danish_ci; -- 0

-- some latin2 characters
-- hazek (Czech/Slovak etc.) and polish 'ł' are treated inconsequently
select 'ž' = 'z' collate utf8_danish_ci; -- 1 ;
select 'ř' = 'r' collate utf8_danish_ci; -- 1
select 'ł' = 'l' collate utf8_danish_ci; -- 0

(I know that Polish 'ł' is basically comparable to latin1 'w' (why not let 'ł' = 'w' then, if it cannot be = 'l' ?) - and I also know the Czech letters. I believe that 'ř' is so much different from 'r' that equalisation makes no sense.  

If the argument that 'ř' = 'r' is for caligraphical/typograchical reasons then also 'ł' = 'l'!  And if phonetics matters then 'ł' = 'w' and 'ř' = nothing!!
But I realize that this is inherited from utf8_general_ci and utf8_unicode_ci like that. So this is a more general problem!

And btw: what about select 'ï' = 'i' collate utf8_turkish_ci; -- returns 1
What do Turkish people say about that? :-)

Issue with 'ï' in Turkish reported here: 
http://bugs.mysql.com/bug.php?id=37595

Peter,

many thanks for your comment here.

Because of all this different country rules for sorting, we have to treat every collation separate.

So, it was ok, that you opened a second bug for Turkish.

This bug here only will look for Danish.

I already gave my colleagues links where they can read more about Danish rules.

Unfortunately, I only found description in Danish and in German not in English. 
That's why I didn't add them here public.

The most necessary stuff for us I translated and pasted into one of my above comments.

WL#5213 Danish collation
http://forge.mysql.com/worklog/task.php?id=5213