Bug #57765 | Sorting utf8 with Norwegian characters is in wrong order | ||
---|---|---|---|
Submitted: | 27 Oct 2010 10:15 | Modified: | 28 Oct 2010 7:55 |
Reporter: | Christoffer Hafsahl | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Server: Charsets | Severity: | S3 (Non-critical) |
Version: | 5.1.41, 5.1.52 | OS: | Linux (From package repository) |
Assigned to: | CPU Architecture: | Any |
[27 Oct 2010 10:15]
Christoffer Hafsahl
[27 Oct 2010 11:04]
Peter Laursen
I am the reporter of the other bug. But here I think you forgot a USE statement and probably created the table in another database than where you thought. With MySQL 5.1.51 I get: CREATE DATABASE test2 CHARACTER SET utf8 COLLATE utf8_danish_ci; USE test2; -- <-- note this CREATE TABLE TestList( id INT AUTO_INCREMENT PRIMARY KEY, TEXT VARCHAR(255)) ENGINE = MYISAM; INSERT INTO TestList (TEXT) VALUES ('Anders'), ('Eivind'), ('Ole'), ('Martin'), ('Aase'),('Åsmund'), ('Øystein'), ('Ægir'); SELECT * FROM TestList ORDER BY TEXT; /* returns id text ------ -------- 1 Anders 2 Eivind 4 Martin 3 Ole 8 Ægir 7 Øystein 5 Aase 6 Åsmund */ This is OK. But not OK if 'Aase' was 'Aaron'. That was my point!
[27 Oct 2010 11:08]
Peter Laursen
Besides I have commented a few other places that the only solution with the ambiguity of 'aa' in modern (practical) Danish would be a (new type of) collation based on a dictionary lookup.
[27 Oct 2010 11:13]
Christoffer Hafsahl
The databases were changed correctly. Anything else that might be wrong? Any changes between 5.1.41 (mine) and 5.1.51 (yours) that might be connected to this in any way? mysql> use test; Database changed mysql> select * from TestList order by text; +----+----------+ | id | text | +----+----------+ | 7 | Øystein | | 9 | Öyvind | | 6 | Åsmund | | 8 | Ægir | | 1 | Anders | | 2 | Eivind | | 4 | Martin | | 3 | Ole | | 5 | Aase | +----+----------+ 9 rows in set (0.00 sec) As for the 'aa' issue; this will be difficult to solve in a way that will be correct in all situations, but if nothing else æøå should be in the right order.
[27 Oct 2010 11:52]
Peter Laursen
well .. it can be a bug in 5.1.41 or a packaging issue .. no clue. From where exactly did you get your package? Anyway I'll let MySQL people comment on this
[27 Oct 2010 11:57]
Valeriy Kravchuk
This is what I've got on Mac OS X: macbook-pro:5.1 openxs$ bin/mysql -uroot test Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 2 Server version: 5.1.52-debug Source distribution Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved. This software comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to modify and redistribute it under the GPL v2 license Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> CREATE DATABASE test2 CHARACTER SET utf8 COLLATE utf8_danish_ci; Query OK, 1 row affected (0.00 sec) mysql> use test2; Database changed mysql> CREATE TABLE TestList( id INT AUTO_INCREMENT PRIMARY KEY, TEXT VARCHAR(255)) ENGINE = -> MYISAM; Query OK, 0 rows affected (0.07 sec) mysql> INSERT INTO TestList (TEXT) VALUES ('Anders'), ('Eivind'), ('Ole'), ('Martin'), -> ('Aase'),('Åsmund'), ('Øystein'), ('Ægir'); Query OK, 8 rows affected (0.00 sec) Records: 8 Duplicates: 0 Warnings: 0 mysql> select * from TestList ORDER BY text; +----+----------+ | id | TEXT | +----+----------+ | 7 | Øystein | | 6 | Åsmund | | 8 | Ægir | | 1 | Anders | | 2 | Eivind | | 4 | Martin | | 3 | Ole | | 5 | Aase | +----+----------+ 8 rows in set (0.00 sec)
[27 Oct 2010 11:58]
Christoffer Hafsahl
The package came from the repository for Ubuntu Lucid. Package: mysql-server Status: installert Version: 5.1.41-3ubuntu12.6
[27 Oct 2010 12:57]
Peter Gulutzan
Please say SELECT id, HEX(TEXT) FROM TestList; I expect you will find from the encoding that the contents of text are not the UTF8 values for Øystein etc. It has to be UTF8 in the client, in the server, and all the way between.
[27 Oct 2010 13:03]
Valeriy Kravchuk
This is what I have: mysql> use test2; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed mysql> show variables like 'char%'; +--------------------------+---------------------------------------------+ | Variable_name | Value | +--------------------------+---------------------------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | latin1 | | character_set_system | utf8 | | character_sets_dir | /Users/openxs/dbs/5.1/share/mysql/charsets/ | +--------------------------+---------------------------------------------+ 8 rows in set (0.01 sec) mysql> exit; Bye macbook-pro:5.1 openxs$ echo $LANG uk_UA.UTF-8 macbook-pro:5.1 openxs$ bin/mysql -uroot test2 Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 4 Server version: 5.1.52-debug Source distribution Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved. This software comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to modify and redistribute it under the GPL v2 license Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> SELECT id, HEX(TEXT) FROM TestList; +----+----------------------+ | id | HEX(TEXT) | +----+----------------------+ | 1 | 416E64657273 | | 2 | 456976696E64 | | 3 | 4F6C65 | | 4 | 4D617274696E | | 5 | 41617365 | | 6 | C383E280A6736D756E64 | | 7 | C383CB9C79737465696E | | 8 | C383E280A0676972 | +----+----------------------+ 8 rows in set (0.00 sec) mysql> select *, hex(text) from TestList ORDER BY text;+----+----------+----------------------+| id | TEXT | hex(text) | +----+----------+----------------------+ | 7 | Øystein | C383CB9C79737465696E | | 6 | Åsmund | C383E280A6736D756E64 | | 8 | Ægir | C383E280A0676972 | | 1 | Anders | 416E64657273 | | 2 | Eivind | 456976696E64 | | 4 | Martin | 4D617274696E | | 3 | Ole | 4F6C65 | | 5 | Aase | 41617365 | +----+----------+----------------------+ 8 rows in set (0.00 sec)
[27 Oct 2010 13:20]
Peter Laursen
I doubt that this is the issue. Characters are right - only sorting is wrong.
[27 Oct 2010 13:27]
Peter Laursen
I tried a 5.1.36 server on RHEL5 linux. Also here I get SELECT * FROM testsort ORDER BY txt; /* returns id text ------ -------- 1 Anders 2 Eivind 4 Martin 3 Ole 8 Ægir 7 Øystein 5 Aase 6 Åsmund (but I also do SET NAMES UTF8 automatically when connecting with SQLyog)
[27 Oct 2010 13:35]
Christoffer Hafsahl
mysql> select *, hex(text) from TestList ORDER BY text; +----+----------+----------------------+ | id | TEXT | hex(text) | +----+----------+----------------------+ | 7 | Øystein | C383CB9C79737465696E | | 6 | Åsmund | C383E280A6736D756E64 | | 8 | Ægir | C383E280A0676972 | | 1 | Anders | 416E64657273 | | 2 | Eivind | 456976696E64 | | 4 | Martin | 4D617274696E | | 3 | Ole | 4F6C65 | | 5 | Aase | 41617365 | +----+----------+----------------------+ 8 rows in set (0.00 sec) mysql> show variables like 'char%'; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | latin1 | | character_set_server | latin1 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+ The hexcodes are identical to what Valeriy posted. However, doing the following seemingly resolves the issue: mysql> set character set 'utf8'; Query OK, 0 rows affected (0.00 sec) mysql> show variables like 'char%'; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | latin1 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+ 8 rows in set (0.00 sec) If I do a select on the table now, all the scandinavian characters are garbled. Deleting and reinserting appears to fix the problem, but now my hex values are different from the ones posted by Valeriy: hex(text) from Valeriy hex(text) from me +----+----------+----------------------+ | 7 | Øystein | C383CB9C79737465696E | C39879737465696E | 6 | Åsmund | C383E280A6736D756E64 | C385736D756E64 | 8 | Ægir | C383E280A0676972 | C386676972 | 1 | Anders | 416E64657273 | 416E64657273 | 2 | Eivind | 456976696E64 | 456976696E64 | 4 | Martin | 4D617274696E | 4D617274696E | 3 | Ole | 4F6C65 | 4F6C65 | 5 | Aase | 41617365 | 41617365 +----+----------+----------------------+ I'd say that it's obvious that I'm doing something wrong here, but what? Except for the wrong sort order, I haven't had any encoding related issues with this setup.
[27 Oct 2010 13:41]
Valeriy Kravchuk
Somehow I wrongly assumed I had utf8 everywhere initially. I was wrong. This is how it should be done: macbook-pro:5.1 openxs$ bin/mysql -uroot test2 Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 5 Server version: 5.1.52-debug Source distribution Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved. This software comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to modify and redistribute it under the GPL v2 license Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> show variables like 'char%'; +--------------------------+---------------------------------------------+ | Variable_name | Value | +--------------------------+---------------------------------------------+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | latin1 | | character_set_server | latin1 | | character_set_system | utf8 | | character_sets_dir | /Users/openxs/dbs/5.1/share/mysql/charsets/ | +--------------------------+---------------------------------------------+ 8 rows in set (0.00 sec) mysql> set names utf8; Query OK, 0 rows affected (0.00 sec) mysql> show variables like 'char%'; +--------------------------+---------------------------------------------+ | Variable_name | Value | +--------------------------+---------------------------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | latin1 | | character_set_system | utf8 | | character_sets_dir | /Users/openxs/dbs/5.1/share/mysql/charsets/ | +--------------------------+---------------------------------------------+ 8 rows in set (0.00 sec) mysql> drop table TestList; Query OK, 0 rows affected (0.00 sec) mysql> CREATE TABLE TestList( id int auto_increment primary key, text varchar(255)) ENGINE = -> MyISAM; Query OK, 0 rows affected (0.05 sec) mysql> mysql> INSERT INTO TestList (text) VALUES ('Anders'), ('Eivind'), ('Ole'), ('Martin'), ('Aase'), -> ('Åsmund'), ('Øystein'), ('Ægir'); Query OK, 8 rows affected (0.00 sec) Records: 8 Duplicates: 0 Warnings: 0 mysql> mysql> select * from TestList ORDER BY text; +----+----------+ | id | text | +----+----------+ | 1 | Anders | | 2 | Eivind | | 4 | Martin | | 3 | Ole | | 8 | Ægir | | 7 | Øystein | | 5 | Aase | | 6 | Åsmund | +----+----------+ 8 rows in set (0.00 sec)
[28 Oct 2010 7:55]
Christoffer Hafsahl
Using SET NAMES utf8 also works; everything is sorted as it should although data had to be reinserted. Is there any way to simplify this for others using precompiled packages? I'd imagine I'm not the only one using Ubuntu, Debian, or any other OS with a package system where this might be a problem. Could it be possible to force the character sets for the client, connection and results to be the same charset as the database, either by default (probably not a good idea) or by a config switch?
[28 Oct 2010 11:34]
Alexander Barkov
Client side character set auto-detection was implemented in 5.5: This is am excerpt from: http://dev.mysql.com/doc/refman/5.5/en/charset-connection.html The MySQL client programs mysql, mysqladmin, mysqlcheck, mysqlimport, and mysqlshow determine the default character set to use as follows: ... The programs can autodetect which character set to use based on the operating system setting, such as the value of the LANG or LC_ALL locale environment variable on Unix systems or the code page setting on Windows systems. For systems on which the locale is available from the OS, the client uses it to set the default character set rather than using the compiled-in default. For example, setting LANG to ru_RU.KOI8-R causes the koi8r character set to be used. Thus, users can configure the locale in their environment for use by MySQL clients. In the versions before 5.5 please use either of these ways: 1. SET NAMES utf8; 2. mysql --default-character-set=utf8 3. mysql.ini: [client] default-character-set=utf8