Bug #43306 Wrong utf8 collation default behaviour
Submitted: 2 Mar 2009 13:16 Modified: 12 Mar 2009 12:54
Reporter: Susanne Ebrecht Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: Charsets Severity:S1 (Critical)
Version:5.1, 6.0 OS:Any
Assigned to: CPU Architecture:Any
Tags: regression

[2 Mar 2009 13:16] Susanne Ebrecht
Description:
Test:

DROP DATABASE IF EXISTS bug;
CREATE DATABASE bug CHARSET UTF8;
SET NAMES UTF8;
USE bug;
CREATE TABLE t (a integer, b char(5) CHARSET UTF8, PRIMARY KEY(a,b))ENGINE=MYISAM;

-- Now try to insert two rows:

INSERT INTO t VALUES (23, 0xC39654C5A13261);
INSERT INTO t VALUES (23, 0xC39674C39F32C385);

SELECT * FROM t;

+----+----------+
| a  | b        |
+----+----------+
| 23 | ÖTš2a  | 
| 23 | Ötß2Å | 
+----+----------+

(23 and ÖTš2a) != (23 and Ötß2Å)

MySQL 5.0.79

All works fine and as expected.

MySQL 5.1.33
ERROR 1062 (23000): Duplicate entry '23-Ötß2Å' for key 'PRIMARY'

MySQL 6.0.11
ERROR 1062 (23000): Duplicate entry '23-Ötß2Å' for key 'PRIMARY'

How to repeat:
See above

Suggested fix:
Something went wrong with collation
[2 Mar 2009 13:18] Susanne Ebrecht
This is shooting every database that will be updated from 5.0 to 5.1.
[2 Mar 2009 13:18] Susanne Ebrecht
For keys binary collation should be used.
[3 Mar 2009 5:11] Alexander Barkov
After we fixed "Bug#27877 incorrect german order in utf8_general_ci",
this is true:

s= š = ß

and therefore:

ÖTš2a = Ötß2Å 

So primary key violation is expected.

I can't see bugs here.

Some more info about Bug#27877 
> Noted in 5.1.24 changelog:
> The utf8_general_ci collation incorrectly did not sort "U+00DF SHARP S" equal > to 's'.
> Pushed into 6.0.5-alpha
[3 Mar 2009 6:36] Susanne Ebrecht
Workaround is to add collation utf8_bin to the columns before upgrading from 5.0 to 5.1.
[5 Mar 2009 16:14] Peter Gulutzan
Alexander Barkov wrote: "I can't see bugs here."

I agree, this particular behaviour is expected.
In fact some data loss or corruption is expected. 
It's too late to change the decision.
There is no simple solution involving mysql_upgrade.
Making 'ß' = 'ss' now would make things far worse.

Not a Bug.
[5 Mar 2009 16:33] Susanne Ebrecht
The bug here is that you are not able to upgrade a 5.0 database to 5.1.

mysql_upgrade is failing.
mysqldump and restore also fails.

So, bug is incompatibility.
[5 Mar 2009 16:36] Susanne Ebrecht
Additionally the bug is that ß is treated as 's'. There is none official rule outside that is saying that 'ß' should be treated as 's'. Neither rules for German language nor unicode rules. There are just rules outside that say ß should be treated as 'ss'.
[6 Mar 2009 6:31] Alexander Barkov
Susanna,

utf8_general_ci and latin1_german1_ci are made using a simple
collation driver which supports only one-to-one mapping between
a character and its sorting weight. These collation give the best
performance. We won't change neither of them.

What can be done - we can add new collations,
which will work as follows:

Auml     = A
Ouml     = O
Uuml     = U
SHARP S  = SS

They will work slower, especially on unindexed ORDER BY -
because the file_sort method will need twice more disk space,
comparing to the "one-to-one" collations.

Note, there is no need to add a Unicode collation.
It already exists - utf8_unicode_ci.

We only need to add a latin1 collation with these rules.

In the end of 2008 I proposed to add "WL#1287 REAL DIN-1 German collation".
Susanne replied: not familiar in Germany at at, very low priority.
So the decision we made altogether was: not for 6.1.

Susanne, if you think we made this decision in a mistake, we can
discuss again adding WL#1287 for 6.1. However, there are very little chances
to start coding this task before July. Our Q1 and Q2 plans are already tight 
enough. Unless we find a volunteer :)
[10 Mar 2009 13:08] Alexander Barkov
"SHARP S" is equal to "S" due to the fix
"Bug#27877 incorrect german order in utf8_general_ci".
That was *intentional* change. There are no bugs here.
[11 Mar 2009 9:10] Susanne Ebrecht
Alexander,

I still have the opinion that German1 is not so familiar then German2.

But this is another topic.

In any case, neither dump/restore nor mysql_update should crash here and this needs fixing.
[11 Mar 2009 14:44] Peter Gulutzan
I read the main description and it is not about a crash in mysql_upgrade,
perhaps if there is a reproducible case of that then a separate report
might be verifiable.

The described behaviour in the main description is not a bug.
[12 Mar 2009 8:24] Susanne Ebrecht
Peter,

"This is shooting every database that will be updated from 5.0 to 5.1."

Also there was a question again, what is the bug here and the answer was:

"The bug here is that you are not able to upgrade a 5.0 database to 5.1.

mysql_upgrade is failing.
mysqldump and restore also fails."

The original intention of this bug report is that all our upgrading possibilities will fail.

Means: Dump/Restore, mysql_upgrade and MA Backup/Restore.

The result of deeper analysis during verifying was that it will fail because of changes in utf8_general_ci.
[12 Mar 2009 12:54] Susanne Ebrecht
Because of this chaos here I opened a fresh bug report:

Bug #43593 explicit point out that dump/backup/restore/upgrade is failing.

I will close this bug report now as duplicate of #43593