Bug #37571 | Inconsistence in Danish collations | ||
---|---|---|---|
Submitted: | 22 Jun 2008 17:26 | Modified: | 23 Jun 2008 20:36 |
Reporter: | Peter Laursen (Basic Quality Contributor) | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: Charsets | Severity: | S4 (Feature request) |
Version: | 4.1,5.0,5.1,6.0 | OS: | Any |
Assigned to: | Assigned Account | CPU Architecture: | Any |
Tags: | qc |
[22 Jun 2008 17:26]
Peter Laursen
[22 Jun 2008 17:47]
Peter Laursen
I meant And if such problems *ONLY* occur with Danish this is ...
[22 Jun 2008 18:51]
Peter Laursen
This is an official Danish resource: http://www.dsn.dk/nfs/2002-3.htm So - my mistake - it was actually already in 1947 that 'å' officially replaced 'aa'. But 1st edition after that of the official Danish language dictionary came in 1953 I believe!
[22 Jun 2008 19:07]
Peter Laursen
more inconsistencies! 'ä' and 'ö' are in Danish alphabetized as 'æ' and 'ø'. (æ and ø are barely the Danis glyphs for the same phonems that in German and Swedish are written ä and ö). but set names utf8; select 'æ' = 'ä' collate utf8_danish_ci; -- 1 select 'ø' = 'ö' collate utf8_danish_ci; -- 1 set names latin1; select 'æ' = 'ä' collate latin1_danish_ci; -- 0 select 'ø' = 'ö' collate latin1_danish_ci; -- 0
[22 Jun 2008 22:20]
Trudy Pelzer
It is true that sometimes two collations with similar names have slightly different rules. We are aware of this long-standing situation. I have changed the Severity to S4 (Feature Request). See also MySQL Character Set Forums thread "Danish/Norwegian collation and 'aa' = 'å' when sorting" http://forums.mysql.com/read.php?103,62869,62869#msg-62869 and "Sorting norwegian names problem aa, a, b, æ, ø, å... aa is sorted before a, should be like å" http://forums.mysql.com/read.php?103,183102,183102#msg-183102
[23 Jun 2008 12:37]
Susanne Ebrecht
Peter, many thanks for writing a bug report. I will analyse this. One of topic comment here: 2: 'Aalborg' = 'Ålborg' but 'Aachen' != 'Åchen' å is pronounced as a kind of "o" and Aachen in lower German (and in slang) is "Oche". But of course you are right, the town Aachen never would accept that it's name is written with 'Å'.
[23 Jun 2008 13:36]
Susanne Ebrecht
Verified as described on bzr tree of 5.0, 5.1 and 6.0. I also tested utf8_swedish_ci, latin1_swedish_ci, latin1_german1_ci, latin1_german2_ci and utf8_unicode_ci. This non consequent behaviour is only in latin1_danish_ci and utf8_danish_ci. Latin1 and UTF8 collation with name: language[X]_ci should behave the same way for the same letters. In my eyes here is a bug either in latin1_danish_ci or in utf8_danish_ci. Danish sorting rule is: 'å' should be after the alphabet ... means: z, æ, ø, å. Also 'å' <=> aa. In special names like Aachen, Saar, Maas, Aaltje (a Frisian female pre name) or Aaron the 'aa' won't be replaced with 'å'. For sorting 'aa' will be sorted together with 'å' after 'å' means i.e.: Åge, Aage, Åmund, Aamund, ... So this means the bug here is in latin1_danish_ci.
[23 Jun 2008 13:38]
Peter Laursen
I think a *fully satisfactory* solution for the 'aa'/'å' problems in Danish (and Norwegian?) is hard to achive. Only there should be collations so that users have the choice! But I think it is a plain bug that ('ä' != 'æ') and and ('ö' != 'ø') in one of the collations!
[23 Jun 2008 13:39]
Peter Laursen
@Suzanne you should check with Norwegian collations too, I think!
[23 Jun 2008 13:47]
Susanne Ebrecht
Peter, I couldn't find a separate Norwegian collation in MySQL 5.0. And yes, this is already on our todo list together with German2 UTF8 collation and lots of others.
[23 Jun 2008 20:11]
Peter Laursen
I found some more fun stuff by experimenting with some characters that are more 'exotic' to Danish! -- latin1 examples set names utf8; select 'ü' = 'y' collate utf8_danish_ci; -- 1 select 'ñ' = 'n' collate utf8_danish_ci; -- 1 -- but now the inconsequence set names latin1; select 'ü' = 'y' collate latin1_danish_ci; -- 0 select 'ñ' = 'n' collate latin1_danish_ci; -- 0 -- some latin2 characters -- hazek (Czech/Slovak etc.) and polish 'ł' are treated inconsequently select 'ž' = 'z' collate utf8_danish_ci; -- 1 ; select 'ř' = 'r' collate utf8_danish_ci; -- 1 select 'ł' = 'l' collate utf8_danish_ci; -- 0 (I know that Polish 'ł' is basically comparable to latin1 'w' (why not let 'ł' = 'w' then, if it cannot be = 'l' ?) - and I also know the Czech letters. I believe that 'ř' is so much different from 'r' that equalisation makes no sense. If the argument that 'ř' = 'r' is for caligraphical/typograchical reasons then also 'ł' = 'l'! And if phonetics matters then 'ł' = 'w' and 'ř' = nothing!! But I realize that this is inherited from utf8_general_ci and utf8_unicode_ci like that. So this is a more general problem! And btw: what about select 'ï' = 'i' collate utf8_turkish_ci; -- returns 1 What do Turkish people say about that? :-)
[23 Jun 2008 20:36]
Peter Laursen
Issue with 'ï' in Turkish reported here: http://bugs.mysql.com/bug.php?id=37595
[24 Jun 2008 10:21]
Susanne Ebrecht
Peter, many thanks for your comment here. Because of all this different country rules for sorting, we have to treat every collation separate. So, it was ok, that you opened a second bug for Turkish. This bug here only will look for Danish. I already gave my colleagues links where they can read more about Danish rules. Unfortunately, I only found description in Danish and in German not in English. That's why I didn't add them here public. The most necessary stuff for us I translated and pasted into one of my above comments.
[8 Jan 2010 21:12]
Peter Gulutzan
WL#5213 Danish collation http://forge.mysql.com/worklog/task.php?id=5213