MySQL Bugs: #27877: incorrect german order in utf8_general

Bug #27877	incorrect german order in utf8_general_ci
Submitted:	17 Apr 2007 8:43	Modified:	26 Mar 2008 17:58
Reporter:	Domas Mituzas	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S3 (Non-critical)
Version:	5.1-bk, 5.0-bk	OS:	Any
Assigned to:	Alexander Barkov	CPU Architecture:	Any

Description:
The manual (http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html) tells:

"
A difference between the collations is that this is true for utf8_general_ci:

ß = s
Whereas this is true for utf8_unicode_ci:

ß = ss
"

How to repeat:
mysql> set names utf8;
Query OK, 0 rows affected (0.03 sec)

mysql> insert into ge values ('a'),('b'),('s'),('u'),('ß');
Query OK, 5 rows affected (0.05 sec)
Records: 5  Duplicates: 0  Warnings: 0

mysql> select * from ge order by a collate utf8_general_ci;
+------+
| a    |
+------+
| a    | 
| b    | 
| s    | 
| u    | 
| ß   | 
+------+
5 rows in set (0.01 sec)

mysql> select * from ge order by a collate utf8_unicode_ci;
+------+
| a    |
+------+
| a    | 
| b    | 
| s    | 
| ß   | 
| u    | 
+------+
5 rows in set (0.02 sec)

Suggested fix:
correct the manual (specify, that German language ordering is not done by utf8_general_ci), or fix the collation.

Domas, 
please add "SHOW CREATE TABLE" output,
and also results of these queries:

mysql> select a, hex(a) from ge order by a collate utf8_general_ci;
mysql> select a, hex(a) from ge order by a collate utf8_unicode_ci;

Domas,

Sorry, there's no need for additional info.
I manage to repeat this problem. This is really a bug.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/42026

ChangeSet@1.2546, 2008-02-11 16:28:33+04:00, bar@mysql.com +10 -0
  Bug#27877 incorrect german order in utf8_general_ci
  Problem: incorrect sort order for "U+00DF SHARP S".
  Fix: changing sort order for U+00DF to be equal to 's',
  like the manual says.

Pushed into 5.1.24-rpl

I guess this fix not only apply to sortorder, but also to selection with "=" and "like"?

set names utf8 COLLATE utf8_general_ci;
select 'a' = 'Ä'; # ok
select 'ß' = 's'; # not ok

Pushed into 5.1.24-rc

Noted in 5.1.24 changelog.

The utf8_general_ci collation incorrectly did not sort "U+00DF SHARP S" equal to 's'.

Pushed into 6.0.5-alpha

Fix also noted in the changelogs for 5.1.23-ndb-6.3.11 and 6.0.5.

Addition to changelog entry:

As a result of this fix, any indexes on columns that use the
utf8_general_ci or ucs2_general_ci collation (especially columns that
use German SHARP S) must be rebuilt when upgrading to 5.1.24/6.0.5 or
higher. To do this, use ALTER TABLE to drop and re-add the indexes,
or mysqldump to dump the affected tables and mysql to reload the dump
file.

(Coming from a related Drupal issue http://drupal.org/node/772678)

Although this bug report and overall issue talks about sort order only, it's the only issue in a series of issues related to this topic, which actually led to a commit in functional code.  If this is not the proper issue, it would be great to know where I should follow-up instead.

Not only the sort order for utf8_general_ci was changed, but also equality, leading to MySQL reporting duplicate keys upon trying to import existing data from previous versions or other data sources.

The initially stated comparison rules were only meant to be applied for sort order, and totally not for comparison/equality.

"
ß = s
"
is unacceptable to be equal (for German, that is).  Sorting 'ß' (SHARP-S) directly after 's' is acceptable, but that is where the equality ends.
The original bug description rightfully stated the sort order, only.

utf8_general_ci was correct in 5.0.x, but is broken with 5.1.x.  'ß' is equal to 's' now.  That's plain wrong; if at all, 'ß' may be transliterated to 'ss'.

Example code to replicate this issue, using two entirely different German words "weiß" and "weis":

mysql> SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT version();
+------------+
| version()  |
+------------+
| 5.1.36-log |
+------------+
1 row in set (0.00 sec)

mysql> CREATE DATABASE test DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
Query OK, 1 row affected (0.00 sec)

mysql> USE test;
Database changed

mysql> CREATE TABLE t1 ( fieldname VARCHAR(255) NOT NULL DEFAULT '', PRIMARY KEY(fieldname) ) DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
Query OK, 0 rows affected (0.11 sec)

mysql> INSERT INTO t1 VALUES ('weiß'), ('weis');
ERROR 1062 (23000): Duplicate entry 'weis' for key 'PRIMARY'

mysql> ALTER TABLE t1 DROP PRIMARY KEY;
Query OK, 1 row affected (0.03 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> ALTER TABLE t1 ADD UNIQUE KEY(fieldname);
Query OK, 1 row affected (0.09 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> INSERT INTO t1 VALUES ('weiß'), ('weis');
ERROR 1062 (23000): Duplicate entry 'weiß' for key 'fieldname'

mysql> CREATE TABLE t2 ( fieldname VARCHAR(255) NOT NULL DEFAULT '', PRIMARY KEY(fieldname) ) DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Query OK, 0 rows affected (0.11 sec)

mysql> INSERT INTO t2 VALUES ('weiß'), ('weis');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> SELECT 's' = 'ß';
+------------+
| 's' = 'ß'  |
+------------+
|          1 |
+------------+
1 row in set (0.00 sec)

mysql> SELECT 'ss' = 'ß';
+-------------+
| 'ss' = 'ß'  |
+-------------+
|           0 |
+-------------+
1 row in set (0.00 sec)

mysql> SELECT 's' = 'ß' COLLATE utf8_unicode_ci;
+------------------------------------+
| 's' = 'ß' COLLATE utf8_unicode_ci  |
+------------------------------------+
|                                  0 |
+------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT 'ss' = 'ß' COLLATE utf8_unicode_ci;
+-------------------------------------+
| 'ss' = 'ß' COLLATE utf8_unicode_ci  |
+-------------------------------------+
|                                   1 |
+-------------------------------------+
1 row in set (0.00 sec)

The comparison rules, i.e., equality in utf8_unicode_ci seems to be correct.  Equality in utf8_general_ci needs to be restored.

utf8_general_ci won't be changed anymore.

Please use utf8_unicode_ci instead, it provides correct German order.
You might also want to try utf8_german2_ci, which was added into 5.5.
It provides German Phonebook order.

Please understand that this is NOT about sort order.  It's about a bogus equality.

utf8_general_ci can no longer be used for any application that stores strings in German, since all of a sudden the collation considers two characters being equal, which are not equal.  They never were equal, and they will never be equal.

This is like considering the words 'more' and 'move' to be the same.  But they are not, because 'r' != 'v'

As in MySQL 5.0.x, utf8_general_ci is supposed to do simple character comparisons only.  'ß' (SHARP-S) is only equal to itself, 'ß':

mysql> SELECT version();
+----------------------+
| version()            |
+----------------------+
| 5.0.51b-community-nt |
+----------------------+
1 row in set (0.00 sec)

mysql> SET NAMES utf8;
Query OK, 0 rows affected (0.01 sec)

mysql> SELECT 's' = 'ß';
+-----------+
| 's' = 'ß' |
+-----------+
|         0 |
+-----------+
1 row in set (0.00 sec)

mysql> SELECT 'ss' = 'ß';
+------------+
| 'ss' = 'ß' |
+------------+
|          0 |
+------------+
1 row in set (0.00 sec)

mysql> SELECT 'ß' = 'ß';
+-----------+
| 'ß' = 'ß' |
+-----------+
|         1 |
+-----------+
1 row in set (0.00 sec)

Daniel,

This is our new policy - we don't change any collations anymore. Never.
They introduce a lot of upgrade problems. So the current implementation
of utf8_general_ci will stay forever.

The plan is to add a new collation, which will be compatible with the
old utf8_general_ci. The exact name is not known yet.

Meanwhile, please use utf8_unicode_ci or utf8_german2_ci.

> They introduce a lot of upgrade problems. So the current implementation
> of utf8_general_ci will stay forever.

That's how I encountered this bug: trying to upgrade different databases from 5.0 to 5.1.  I understand and actually support your position, but there is a logical flaw in it:

If utf8_general_ci was correct in 5.0 and a change in 5.1 broke it, then you introduce and manifest a lot of upgrade problems by not reverting the bogus change and not going back to the correct implementation.

Hence, to make any sense of the policy, critical changes (like this) should be reverted to match the implementation in 5.0, so users do not experience such upgrade problems.  Especially, as it seems that hosting/application providers are starting to switch to 5.1 just recently.

A strict execution of the policy would implement the changes as new collations, not affecting existing collations in-use.

> The plan is to add a new collation, which will be compatible with the
> old utf8_general_ci. The exact name is not known yet.

If I take it correctly that there is no collation that is fully identical to utf8_general_ci under 5.0, then this only hardens aforementioned logical flaw.

People are losing valid data due to the collation change.

> Meanwhile, please use utf8_unicode_ci or utf8_german2_ci.

I'd love to, but

- utf8_unicode_ci is not an option, due to http://bugs.mysql.com/bug.php?id=39808.  But also, since 'ß' = 'ss', i.e., still losing data that was valid before.

- On 5.1.36, there is no utf8_german_ci, utf8_german2_ci, or utf8_german3_ci.

That's why we're slightly running in circles here.  Lastly, if the collation change would only affect sort order, but not equality, then I guess almost no one would care.  However, it affects string comparisons, primary and unique keys, and is therefore invalidating existing but totally valid data.

I'd like to know in which language 'ß' = 's' actually holds true.  I was under the assumption that, as also mentioned on http://dev.mysql.com/doc/refman/5.1/en/charset-unicode-sets.html, 'ß' is the German letter sharp-s, and mainly/primarily used in the German language.  However, assuming equality in 'ß' = 's' is completely wrong for the German language.

If 'ß' is actually also used in other languages and rightfully compares to 's' in them, then I happily shut up given some evidence.  Although such a language-specific rule really does not fit into a generic collation named utf8_general_ci, and the change still breaks upgrades.

I would also add that, even if it's true that changing collations should never taken lightly, reverting to 's' != 'ß' cannot potentially break anything. This is a perfectly compatible change.

Damien, it would force everyone who has those characters in an index to rebuild their indexes when upgrading. That's something we are not prepared to do, because it was too painful when we made the first change and we have effectively said that we will never to do that again, unless we have really extreme reasons for doing so. We certainly can't do it as part of a monthly update to 5.1, so it will not change in 5.1.

What you can do is create appropriate bugs for any changes that you think are needed. We do want to be perfect, but also need to be careful in how we get there. Same invitation for everyone involved with Drupal who thinks there's a problem in a collation - new bug and we'll take a look.

For Drupal, work on supporting the use of different collations well will be good. Then as we introduce improved collations based on feedback it'll be possible for Drupal users to use those collations.

I think Damien is right. You should revert the patch.

It is painful to upgrade from 5.0 to 5.1 if you have an unique index. 

You have two options:

1) Rebuilding the index
Then a duplicate key error can occur. You have to manipulate the data.
And this is a big problem and is in fact no solution.

2) Use the old tables
Then the index is not consistent. Things get worse as time goes by. You can have exact the same entries in an unique column and the server finds not the rows it should find.

The patch was broken by design: 
 - It is a threat for the data integrity.
 - It is a problem to ALTER larhe tables

Damien's suggestions is the best you can do. The implications are small compared to the status now.

I think Damien is right. You should revert the patch.

It is painful to upgrade from 5.0 to 5.1 if you have an unique index. 

You have two options:

1) Rebuilding the index
Then a duplicate key error can occur. You have to manipulate the data.
And this is a big problem and is in fact no solution.

2) Use the old tables
Then the index is not consistent. Things get worse as time goes by. You can have exact the same entries in an unique column and the server finds not the rows it should find.

The patch was broken by design: 
 - It is a threat for the data integrity.
 - It is a problem to ALTER large tables

Damien's suggestions is the best you can do. The implications are small compared to the status now.

Holger, the implication of making the change you suggest is to force every existing MySQL 5.1 user who is using that collation to rebuild the indexes on their tables if they get a minor upgrade to their server version. Just five months before the active support for 5.1 is scheduled to end. The change is annoying to some users but that's a tiny fraction of the users who would be upset if we made most of them rebuild all of their indexes after a minor update to their already in production 5.1 server.

Hello James,

okay - but the implications would be lower. 

A fix would be another collation e.g. "utf8_general_ci2" in 5.1 with the same behaviour as utf8_general_ci on 5.0. 

Is there a compatible collation for utf8_general_ci/5.0 for MySQL 5.1 or a later version (5.5?, 5.6?). The main problem is "ß" is equal to "s", not the sorting itself. 

Regards,
Holger

Bar, is there anything more you can write today about name and possible versions for "The plan is to add a new collation, which will be compatible with the old utf8_general_ci. The exact name is not known yet."

See Bug#43593 for a workaround to the incompatibility introduced by this bug fix.

For reference, this was the workaround that was previously documented in the MySQL 5.5 manual. The content was removed 2017/04/12:

For incompatibilities introduced in MySQL 5.1.24 by the fix for Bug
#27877 that corrected the utf8_general_ci and ucs2_general_ci collations,
a workaround is implemented as of MySQL 5.1.62, 5.5.21, and 5.6.5. Upgrade
to one of those versions, then convert each affected table using one of
the following methods. In each case, the workaround altering affected
columns to use the utf8_general_mysql500_ci and ucs2_general_mysql500_ci
collations, which preserve the original pre-5.1.24 ordering of
utf8_general_ci and ucs2_general_ci. 

To convert an affected
table after a binary upgrade that leaves the table files in place, alter
the table to use the new collation. Suppose that the table t1 contains one
or more problematic utf8 columns. To convert the table at the table level,
use a statement like this: 

ALTER TABLE t1 CONVERT TO
CHARACTER SET utf8 COLLATE utf8_general_mysql500_ci; 

To apply the change
on a column-specific basis, use a statement like this (be sure to repeat
the column definition as originally specified except for the COLLATE
clause): 

ALTER TABLE t1 MODIFY c1 CHAR(N) CHARACTER SET
utf8 COLLATE utf8_general_mysql500_ci; 

To upgrade the table using a dump
and reload procedure, dump the table using mysqldump, modify the CREATE
TABLE statement in the dump file to use the new collation, and reload the
table. 

After making the appropriate changes, CHECK TABLE
should report no error.