Bug #57765 Sorting utf8 with Norwegian characters is in wrong order
Submitted: 27 Oct 2010 10:15 Modified: 28 Oct 2010 7:55
Reporter: Christoffer Hafsahl Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.1.41, 5.1.52 OS:Linux (From package repository)
Assigned to: CPU Architecture:Any
Triage: Needs Triage: D2 (Serious)

[27 Oct 2010 10:15] Christoffer Hafsahl
Description:
Sorting in tables with Norwegian / Danish characters doesn't order the results properly for words containing æ, ø, å or ö. By Norwegian standards ö is treated the same as ø, aa is the same as å (for Norwegian words), and the order is after z: æøå. 

I'm not sure if ö is listed after o or after ø, as this is a letter never used in Norway except for foreign placenames (where it is usually substituted with ø).

Å replaced Aa in 1917, and is no longer in use for anything other than names. 

How to repeat:
All done from a linux terminal over ssh from another linux box: 

CREATE DATABASE test CHARACTER SET utf8 COLLATE utf8_danish_ci;
CREATE TABLE TestList( id int auto_increment primary key, text varchar(255)) ENGINE = MyISAM;

INSERT INTO TestList (text) VALUES ('Anders'), ('Eivind'), ('Ole'), ('Martin'), ('Aase'), ('Åsmund'), ('Øystein'), ('Ægir');

mysql> select * from TestList ORDER BY text;
+----+----------+
| id | text     |
+----+----------+
|  7 | Øystein  |
|  6 | Åsmund   |
|  8 | Ægir     |
|  1 | Anders   |
|  2 | Eivind   |
|  4 | Martin   |
|  3 | Ole      |
|  5 | Aase     |
+----+----------+
8 rows in set (0.00 sec)

Suggested fix:
Not sure. This is more or less a duplicate of http://bugs.mysql.com/bug.php?id=37571, although that deals more specifically with Danish. 

"But current latin1_danish_ci is actually 'latin1_modern_danish_ci' and current
utf8_danish_ci is 'utf8_traditional_danish_ci'" (see linked bug)

I'm in no way an expert on the languages, but AFAIK we don't have a modern/traditional way of sorting/spelling words. The norwegian language has evolved over the past ~100 years, but I don't think we have a traditional/modern language, other than different spellings. If Danish differs too much from Norwegian, one option might be to create a utf8_norwegian collation.
[27 Oct 2010 11:04] Peter Laursen
I am the reporter of the other bug.  But here I think you forgot a USE statement and probably created the table in another database than where you thought.

With MySQL 5.1.51 I get: 

CREATE DATABASE test2 CHARACTER SET utf8 COLLATE utf8_danish_ci;
USE test2;  -- <-- note this
CREATE TABLE TestList( id INT AUTO_INCREMENT PRIMARY KEY, TEXT VARCHAR(255)) ENGINE = MYISAM;

INSERT INTO TestList (TEXT) VALUES ('Anders'), ('Eivind'), ('Ole'), ('Martin'), ('Aase'),('Åsmund'), ('Øystein'), ('Ægir');

SELECT * FROM TestList ORDER BY TEXT;
/* returns

    id  text    
------  --------
     1  Anders  
     2  Eivind  
     4  Martin  
     3  Ole     
     8  Ægir   
     7  Øystein
     5  Aase    
     6  Åsmund 
 */

This is OK.  But not OK if 'Aase' was 'Aaron'.  That was my point!
[27 Oct 2010 11:08] Peter Laursen
Besides I have commented a few other places that the only solution with the ambiguity of 'aa' in modern (practical) Danish would be a (new type of) collation based on a dictionary lookup.
[27 Oct 2010 11:13] Christoffer Hafsahl
The databases were changed correctly. Anything else that might be wrong? Any changes between 5.1.41 (mine) and 5.1.51 (yours) that might be connected to this in any way?

mysql> use test;
Database changed
mysql> select * from TestList order by text;
+----+----------+
| id | text     |
+----+----------+
|  7 | Øystein |
|  9 | Öyvind  |
|  6 | Åsmund  |
|  8 | Ægir    |
|  1 | Anders   |
|  2 | Eivind   |
|  4 | Martin   |
|  3 | Ole      |
|  5 | Aase     |
+----+----------+
9 rows in set (0.00 sec)

As for the 'aa' issue; this will be difficult to solve in a way that will be correct in all situations, but if nothing else æøå should be in the right order.
[27 Oct 2010 11:52] Peter Laursen
well .. it can be a bug in 5.1.41 or a packaging issue .. no clue. 

From where exactly did you get your package?
Anyway I'll let MySQL people comment on this
[27 Oct 2010 11:57] Valeriy Kravchuk
This is what I've got on Mac OS X:

macbook-pro:5.1 openxs$ bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.1.52-debug Source distribution

Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved.
This software comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to modify and redistribute it under the GPL v2 license

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> CREATE DATABASE test2 CHARACTER SET utf8 COLLATE utf8_danish_ci;
Query OK, 1 row affected (0.00 sec)

mysql> use test2;
Database changed
mysql> CREATE TABLE TestList( id INT AUTO_INCREMENT PRIMARY KEY, TEXT VARCHAR(255)) ENGINE =
    -> MYISAM;
Query OK, 0 rows affected (0.07 sec)

mysql> INSERT INTO TestList (TEXT) VALUES ('Anders'), ('Eivind'), ('Ole'), ('Martin'),
    -> ('Aase'),('Åsmund'), ('Øystein'), ('Ægir');
Query OK, 8 rows affected (0.00 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql>  select * from TestList ORDER BY text;
+----+----------+
| id | TEXT     |
+----+----------+
|  7 | Øystein |
|  6 | Åsmund  |
|  8 | Ægir    |
|  1 | Anders   |
|  2 | Eivind   |
|  4 | Martin   |
|  3 | Ole      |
|  5 | Aase     |
+----+----------+
8 rows in set (0.00 sec)
[27 Oct 2010 11:58] Christoffer Hafsahl
The package came from the repository for Ubuntu Lucid. 

Package: mysql-server
Status: installert
Version: 5.1.41-3ubuntu12.6
[27 Oct 2010 12:57] Peter Gulutzan
Please say
SELECT id, HEX(TEXT) FROM TestList;

I expect you will find from the encoding
that the contents of text are not the
UTF8 values for Øystein etc.

It has to be UTF8 in the client,
in the server, and all the way between.
[27 Oct 2010 13:03] Valeriy Kravchuk
This is what I have:

mysql> use test2;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show variables like 'char%';
+--------------------------+---------------------------------------------+
| Variable_name            | Value                                       |
+--------------------------+---------------------------------------------+
| character_set_client     | utf8                                        |
| character_set_connection | utf8                                        |
| character_set_database   | utf8                                        |
| character_set_filesystem | binary                                      |
| character_set_results    | utf8                                        |
| character_set_server     | latin1                                      |
| character_set_system     | utf8                                        |
| character_sets_dir       | /Users/openxs/dbs/5.1/share/mysql/charsets/ |
+--------------------------+---------------------------------------------+
8 rows in set (0.01 sec)

mysql> exit;
Bye
macbook-pro:5.1 openxs$ echo $LANG
uk_UA.UTF-8
macbook-pro:5.1 openxs$ bin/mysql -uroot test2
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 4
Server version: 5.1.52-debug Source distribution

Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved.
This software comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to modify and redistribute it under the GPL v2 license

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> SELECT id, HEX(TEXT) FROM TestList;
+----+----------------------+
| id | HEX(TEXT)            |
+----+----------------------+
|  1 | 416E64657273         |
|  2 | 456976696E64         |
|  3 | 4F6C65               |
|  4 | 4D617274696E         |
|  5 | 41617365             |
|  6 | C383E280A6736D756E64 |
|  7 | C383CB9C79737465696E |
|  8 | C383E280A0676972     |
+----+----------------------+
8 rows in set (0.00 sec)

mysql>  select *, hex(text) from TestList ORDER BY text;+----+----------+----------------------+| id | TEXT     | hex(text)            |
+----+----------+----------------------+
|  7 | Øystein | C383CB9C79737465696E |
|  6 | Åsmund  | C383E280A6736D756E64 |
|  8 | Ægir    | C383E280A0676972     |
|  1 | Anders   | 416E64657273         |
|  2 | Eivind   | 456976696E64         |
|  4 | Martin   | 4D617274696E         |
|  3 | Ole      | 4F6C65               |
|  5 | Aase     | 41617365             |
+----+----------+----------------------+
8 rows in set (0.00 sec)
[27 Oct 2010 13:20] Peter Laursen
I doubt that this is the issue.  Characters are right - only sorting is wrong.
[27 Oct 2010 13:27] Peter Laursen
I tried a 5.1.36 server on RHEL5 linux.  Also here I get

SELECT * FROM testsort ORDER BY txt;
/* returns

    id  text    
------  --------
     1  Anders  
     2  Eivind  
     4  Martin  
     3  Ole     
     8  Ægir   
     7  Øystein
     5  Aase    
     6  Åsmund 

(but I also do SET NAMES UTF8 automatically when connecting with SQLyog)
[27 Oct 2010 13:35] Christoffer Hafsahl
mysql> select *, hex(text) from TestList ORDER BY text;
+----+----------+----------------------+
| id | TEXT     | hex(text)            |
+----+----------+----------------------+
|  7 | Øystein | C383CB9C79737465696E |
|  6 | Åsmund  | C383E280A6736D756E64 |
|  8 | Ægir    | C383E280A0676972     |
|  1 | Anders   | 416E64657273         |
|  2 | Eivind   | 456976696E64         |
|  4 | Martin   | 4D617274696E         |
|  3 | Ole      | 4F6C65               |
|  5 | Aase     | 41617365             |
+----+----------+----------------------+
8 rows in set (0.00 sec)

mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

The hexcodes are identical to what Valeriy posted.
However, doing the following seemingly resolves the issue:

mysql> set character set 'utf8';
Query OK, 0 rows affected (0.00 sec)

mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

If I do a select on the table now, all the scandinavian characters are garbled. 
Deleting and reinserting appears to fix the problem, but now my hex values are different 
from the ones posted by Valeriy:

		 hex(text) from Valeriy	 hex(text) from me
+----+----------+----------------------+
|  7 | Øystein | C383CB9C79737465696E  | C39879737465696E
|  6 | Åsmund  | C383E280A6736D756E64  | C385736D756E64
|  8 | Ægir    | C383E280A0676972      | C386676972
|  1 | Anders   | 416E64657273         | 416E64657273
|  2 | Eivind   | 456976696E64         | 456976696E64
|  4 | Martin   | 4D617274696E         | 4D617274696E
|  3 | Ole      | 4F6C65               | 4F6C65
|  5 | Aase     | 41617365             | 41617365
+----+----------+----------------------+

I'd say that it's obvious that I'm doing something wrong here, but what? Except for the wrong sort order, I haven't had any encoding related issues with this setup.
[27 Oct 2010 13:41] Valeriy Kravchuk
Somehow I wrongly assumed I had utf8 everywhere initially. I was wrong. This is how it should be done:

macbook-pro:5.1 openxs$ bin/mysql -uroot test2
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 5
Server version: 5.1.52-debug Source distribution

Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved.
This software comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to modify and redistribute it under the GPL v2 license

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show variables like 'char%';
+--------------------------+---------------------------------------------+
| Variable_name            | Value                                       |
+--------------------------+---------------------------------------------+
| character_set_client     | latin1                                      |
| character_set_connection | latin1                                      |
| character_set_database   | utf8                                        |
| character_set_filesystem | binary                                      |
| character_set_results    | latin1                                      |
| character_set_server     | latin1                                      |
| character_set_system     | utf8                                        |
| character_sets_dir       | /Users/openxs/dbs/5.1/share/mysql/charsets/ |
+--------------------------+---------------------------------------------+
8 rows in set (0.00 sec)

mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> show variables like 'char%';
+--------------------------+---------------------------------------------+
| Variable_name            | Value                                       |
+--------------------------+---------------------------------------------+
| character_set_client     | utf8                                        |
| character_set_connection | utf8                                        |
| character_set_database   | utf8                                        |
| character_set_filesystem | binary                                      |
| character_set_results    | utf8                                        |
| character_set_server     | latin1                                      |
| character_set_system     | utf8                                        |
| character_sets_dir       | /Users/openxs/dbs/5.1/share/mysql/charsets/ |
+--------------------------+---------------------------------------------+
8 rows in set (0.00 sec)

mysql> drop table TestList;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE TestList( id int auto_increment primary key, text varchar(255)) ENGINE =
    -> MyISAM;
Query OK, 0 rows affected (0.05 sec)

mysql> 
mysql> INSERT INTO TestList (text) VALUES ('Anders'), ('Eivind'), ('Ole'), ('Martin'), ('Aase'),
    -> ('Åsmund'), ('Øystein'), ('Ægir');
Query OK, 8 rows affected (0.00 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> 
mysql> select * from TestList ORDER BY text;
+----+----------+
| id | text     |
+----+----------+
|  1 | Anders   |
|  2 | Eivind   |
|  4 | Martin   |
|  3 | Ole      |
|  8 | Ægir    |
|  7 | Øystein |
|  5 | Aase     |
|  6 | Åsmund  |
+----+----------+
8 rows in set (0.00 sec)
[28 Oct 2010 7:55] Christoffer Hafsahl
Using SET NAMES utf8 also works; everything is sorted as it should although data had to be reinserted. 

Is there any way to simplify this for others using precompiled packages? I'd imagine I'm not the only one using Ubuntu, Debian, or any other OS with a package system where this might be a problem. 

Could it be possible to force the character sets for the client, connection and results to be the same charset as the database, either by default (probably not a good idea) or by a config switch?
[28 Oct 2010 11:34] Alexander Barkov
Client side character set auto-detection was implemented in 5.5:

This is am excerpt from:
http://dev.mysql.com/doc/refman/5.5/en/charset-connection.html

The MySQL client programs mysql, mysqladmin, mysqlcheck, mysqlimport, and mysqlshow  determine the default character set to use as follows:
...
      The programs can autodetect which character set to use based on the operating system setting, such as the value of the LANG or LC_ALL locale environment variable on Unix systems or the code page setting on Windows systems. For systems on which the locale is available from the OS, the client uses it to set the default character set rather than using the compiled-in default. For example, setting LANG to ru_RU.KOI8-R causes the koi8r character set to be used. Thus, users can configure the locale in their environment for use by MySQL clients.

In the versions before 5.5 please use either of these ways:

1. SET NAMES utf8;

2. mysql --default-character-set=utf8

3. mysql.ini:
[client]
default-character-set=utf8