Bug #39816 German collation under utf8_unicode_ci is incorrect
Submitted: 2 Oct 2008 16:06 Modified: 6 Oct 2008 11:40
Reporter: Jay Pipes Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.0.51 OS:Linux
Assigned to: CPU Architecture:Any
Tags: character sets, collation

[2 Oct 2008 16:06] Jay Pipes
Description:
While doing some research I noticed a possible collation bug for German text.  According to Wikipedia (http://en.wikipedia.org/wiki/Collating_sequence) the following German words:

Arg
Ärgerlich
Arm
Assistant
Aßlar
Assoziation

Should be sorted in that order.  However, MySQL 5.0.51 does not sort correctly:

mysql> CREATE TABLE utf8_tests (
    ->   my_text VARCHAR(100) NOT NULL
    -> ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.01 sec)

mysql> INSERT INTO utf8_tests VALUES
    -> ('Arg')
    -> ,('Ärgerlich')
    -> ,('Arm')
    -> ,('Assistant')
    -> ,('Aßlar')
    -> ,('Assoziation');
Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM utf8_tests ORDER BY my_text;
+-------------+
| my_text     |
+-------------+
| Aßlar      | 
| Arg         | 
| Arm         | 
| Assistant   | 
| Assoziation | 
| Ärgerlich  | 
+-------------+
6 rows in set (0.00 sec)

mysql> SELECT * FROM utf8_tests ORDER BY my_text COLLATE utf8_unicode_ci;
+-------------+
| my_text     |
+-------------+
| Ärgerlich  | 
| Aßlar      | 
| Arg         | 
| Arm         | 
| Assistant   | 
| Assoziation | 
+-------------+
6 rows in set (0.01 sec)

mysql> SELECT * FROM utf8_tests ORDER BY my_text COLLATE utf8_general_ci;
+-------------+
| my_text     |
+-------------+
| Aßlar      | 
| Arg         | 
| Arm         | 
| Assistant   | 
| Assoziation | 
| Ärgerlich  | 
+-------------+
6 rows in set (0.00 sec)

How to repeat:
USE test;

CREATE TABLE utf8_tests (
   my_text VARCHAR(100) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

INSERT INTO utf8_tests VALUES
 ('Arg')
,('Ärgerlich')
,('Arm')
,('Assistant')
,('Aßlar')
,('Assoziation');

SELECT * FROM utf8_tests ORDER BY my_text;
SELECT * FROM utf8_tests ORDER BY my_text COLLATE utf8_general_ci;
SELECT * FROM utf8_tests ORDER BY my_text COLLATE utf8_unicode_ci;

Suggested fix:
Order the collation correctly.  

According to the manual:

"utf8_general_ci also is satisfactory for both German and French, except that ‘ß’ is equal to ‘s’, and not to ‘ss’. If this is acceptable for your application, then you should use utf8_general_ci because it is faster. Otherwise, use utf8_unicode_ci because it is more accurate."

But, as shown above, utf8_unicode_ci is actually not accurate.

The ICU testing site shows the correct order and code points for collation:

http://demo.icu-project.org/icu-bin/locexp?_=de_DE&d_=en&x=col&collation=phonebook

02: Ärgerlich
2c 34 4e 38 34 4e 42 3c 30 3a 01 88 0d 01 85 8f 0c 00
01: Arg
2c 4e 38 01 07 01 8f 06 00
03: Arm
2c 4e 44 01 07 01 8f 06 00
04: Assistant
2c 50 50 3c 50 52 2c 46 52 01 0d 01 8f 0c 00
05: Aßlar
2c 50 50 42 2c 4e 01 85 e5 b1 08 01 8f 89 89 bd 07 00
06: Assoziation
2c 50 50 48 5e 3c 2c 52 3c 48 46 01 0f 01 8f 0e 00
[6 Oct 2008 11:28] Susanne Ebrecht
Not a bug in 5.1:

Explaination: 
German 1: ä = a, ß = s
German 2: ä = ae, ß = ss

utf8_unicode_ci and latin1_german1_ci are sorting the same way for this example
utf8_general_ci and latin1_german2_ci are sorting the same way for this example

Here my test:

Terminal: UTF8

SET NAMES UTF8;

create table t_utf8(v varchar(100))default charset=utf8;
create table t_latin1(v varchar(100))default charset=latin1;

insert into t_utf8 VALUES('Arg'),('Ärgerlich'),('Arm'),('Assistant'),('Aßlar'),('Assoziation');

insert into t_latin1 VALUES('Arg'),('Ärgerlich'),('Arm'),('Assistant'),('Aßlar'),('Assoziation');

Expected German1: Arg, Ärgerlich, Arm, Aßlar, Assistant, Assoziation
Expected German2: Ärgerlich, Arg, Arm, Assistant, Aßlar, Assoziation

select * from t_utf8 order by v COLLATE utf8_unicode_ci;
+-------------+
| v           |
+-------------+
| Arg         | 
| Ärgerlich  | 
| Arm         | 
| Assistant   | 
| Aßlar      | 
| Assoziation | 
+-------------+

select * from t_utf8 order by v COLLATE utf8_general_ci;
+-------------+
| v           |
+-------------+
| Arg         | 
| Ärgerlich  | 
| Arm         | 
| Aßlar      | 
| Assistant   | 
| Assoziation | 
+-------------+

select * from t_latin1 order by v COLLATE latin1_german1_ci;
+-------------+
| v           |
+-------------+
| Arg         | 
| Ärgerlich  | 
| Arm         | 
| Aßlar      | 
| Assistant   | 
| Assoziation | 
+-------------+

select * from t_latin1 order by v COLLATE latin1_german2_ci;
+-------------+
| v           |
+-------------+
| Ärgerlich  | 
| Arg         | 
| Arm         | 
| Assistant   | 
| Aßlar      | 
| Assoziation | 
+-------------+

So nothing to fix for MySQL 5.1. I have to test 5.0.
[6 Oct 2008 11:38] Susanne Ebrecht
Sorry, I was blind.

Collation utf8_general_ci is sorting the ß like German2 but Ä like German1. Anyway it is not recommended in Germany to use utf8_general_ci. It is recommended to use utf8_unicode_ci.

This behaviour already is described here:

http://dev.mysql.com/doc/refman/5.1/en/charset-unicode-sets.html

Curious but true. The collation utf8_unicode_ci is using ß = ss from German 2 rule (DIN 5007-2) and ä=a, ö=o, ü=u from German 1 rule (DIN 5007-1).

So the feature request is to change ä,ö,ü in utf8_unicode_ci to German 2 rules too or make an own utf8_german2_ci.

I think there is already such a feature request.
[6 Oct 2008 11:40] Susanne Ebrecht
This is a duplicate of bug #38758
[6 Oct 2008 12:18] Susanne Ebrecht
This mad mix of German rules also is given at the Unicode description when you don't have own collations for the language.