Bug #43593 dump/backup/restore/upgrade tools fails because of utf8_general_ci
Submitted: 12 Mar 2009 12:48 Modified: 2 Apr 2012 18:56
Reporter: Susanne Ebrecht Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S1 (Critical)
Version:5.1 OS:Any
Assigned to: Alexander Barkov CPU Architecture:Any

[12 Mar 2009 12:48] Susanne Ebrecht
Description:
dump/backup/restore/upgrade tools fails because of duplicate key errors.

This is the main intention for bug #43306.

Because the discussion run out of topic in bug #43306 I opened this bug report here.

We need a solution for utf8_general_ci and all dump/backup/restore/upgrade tools, that these won't fail during upgrading from 5.0 to 5.1.

Reason for it are changes in utf8_general_´ci.

How to repeat:
MySQL 5.0:

Create a table:

CREATE TABLE t (a integer, b char(5) CHARSET UTF8, PRIMARY KEY(a,b))ENGINE=MYISAM;

INSERT INTO t VALUES (23, 0xC39654C5A13261);
INSERT INTO t VALUES (23, 0xC39674C39F32C385);

Use mysqlupgrade to upgrade to 5.1.

Or dump the table and try to restore in 5.1.

Or use MA Backup for 5.0 and MA Restore for 5.1.

Suggested fix:
Already discussed in bug #43306
[1 Apr 2009 6:28] Susanne Ebrecht
Bug #43810 is set as duplicate of this bug here
[10 May 2009 16:42] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/73714

2892 Alexander Barkov	2009-05-10
      Bug#43593 dump/backup/restore/upgrade tools fails
      
      This is a preliminary patch intended to help to those users:
      - Having indexes on columns with collations
        utf8_general_ci or ucs2_general_ci
      - Having German letter SHARP S (SZLIG) in these columns
      - Upgrading from MySQL from version
        5.1.23 (and earlier) to version 5.1.24 (and higher).
      
      This patch introduces new collations utf8_german3_ci
      and ucs2_german2_ci which reproduce the "old"
      sorting order provided by pre-5.1.24 versions of xxx_general_ci.
      
      
      In order to start using new MySQL-5.1.24+ please do the following:
      
      - Start new version of mysqld
      - Convert all affected tables using this query (in case of utf8):
      
        ALTER TABLE t1 CONVERT TO CHARACTER SET utf8 COLLATE utf8_german3_ci;
      
        Or if you need to apply changes per-column level, using this example:
      
        ALTER TABLE t1 MODIFY c1 CHAR(N) CHARACTER SET utf8 COLLATE utf8_german3_ci;
      
        (Make sure you're using the old data type and size,
         NULL/NOT NULL constraints, etc).
[10 Apr 2010 20:34] gonfi den tschal
it may be obvious: also happens with unique constraints. 

i don't see why i should change the collation of a db field that is not language-specific to german. this sounds like it's only a matter of time until i get errors on other characters. so i'll have to re-implement uniqueness in my application code.
[29 Apr 2010 23:05] Arjen Lentz
Bar and/or Susanne, can you please urgently clarify exactly what the original collation did and what the fix does?
For both before/after, we need to know whether it relates to
 - transcription (like u umlaut can be transcribed to ue and be sorted after)
 - sort order (and in what way, position in alphabet please)
 - equivalence (two different chars regarded as the same)
 * added question: does collation ever introduce equivalence, or always merely order?

Drupal has an open bug ()http://drupal.org/node/772678) which was initiated by the fix to this MySQL bug and we need the above details to be able to assess what is actually going on.

The assertion in the drupal bug is that the bugfix actually makes rows disappear as duplicates on rebuilding of the table. this would indicate equivalence. Now, as far as I know a ringel-S is either separate next to S, at end of alphabet, or transcribed to SS, so I don't see how it could ever trigger a duplicate.
Thus I need clarification on whether equivalence is at work, and in fact certainty that the bugfix is in fact correct. If it is correct, then the Drupal datasets are wrong to have regarded them to be different in the first place - however, they are dealing with distinct names, and thus I don't think it's quite that simple.
Your help is appreciated.
[2 May 2010 12:25] Luca Accomazzi
I don't know if the behaviour that's driving me nuts today is caused by this bug, but it really does seem the same. I'm running 5.1.37  under OS X and my character_set_EVERYTHING is utf8.

CREATE TABLE `stopwords` (
  `language` int(11) NOT NULL,
  `word` varchar(63) NOT NULL,
  PRIMARY KEY (`language`,`word`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

mysql> INSERT INTO stopwords values (1001, 'e');
Query OK, 1 rows affected (0.00 sec)
mysql>  insert into stopwords values (1001, 'è');
ERROR 1062 (23000): Duplicate entry '1001-è' for key 'PRIMARY'

Now I happen to be Italian (not my fault, it just happened to me) and I must tell you that those are two very different words in here. "E" means "and" while "è" means "is".
[2 May 2010 21:03] Peter Gulutzan
To Arjen Lentz:
The decision was that re-indexing and/or
import/export plus data updating would occasionally
be necessary. If there was a magic fix, you
would have seen it long ago.
[2 May 2010 21:04] Peter Gulutzan
To Luca Accomazzi:
The behaviour that you describe has
nothing to do with this bug report.
[3 May 2010 14:07] Luca Accomazzi
I fail to see where the difference is. I came here following the discussion for bug 43306 http://bugs.mysql.com/bug.php?id=43306
Under v5.0, my problem's not there. Look:
mysql> select version();
+-----------+
| version() |
+-----------+
| 5.0.26    | 
+-----------+
1 row in set (0.00 sec)

mysql> select * from stopwords where (word='e' or word='è') and language=1001;
+----------+------+
| language | word |
+----------+------+
|     1001 | e    | 
|     1001 | è   | 
+----------+------+
2 rows in set (0.00 sec)

I may be mistaken. If so, please give me a tiny hint or a link to an explanation, a reason why two very different words can't be inserted in two different records under v5.1 when there's an active key. Thank you in advance.
[3 May 2010 16:09] Peter Gulutzan
With utf8_general_ci, e = e + grave accent in both 5.0 and 5.1.
There was no change. For description of what was changed, see
"Checking Whether Tables or Indexes Must Be Rebuilt" in the manual.
[4 May 2010 0:50] Arjen Lentz
Peter Gulutzan: sorry, I don't how your response addresses the issue or my questions about it. A change was made, and I am simply asking for clarification as to what the original situation was and what the situation is now.

I conclude that a change was made, because people report seeing column values that were previously distinct to now be regarded as equal. Simply rebuilding an index even though there was no change would not do that, would it.
[1 Jun 2010 22:33] Peter Gulutzan
To Arjen Lentz:

You asked:
> For both before/after, we need to know whether it relates to
>  - transcription (like u umlaut can be transcribed to ue and be sorted after)
>  - sort order (and in what way, position in alphabet please)
>  - equivalence (two different chars regarded as the same)

The thing that changed is "collation", which affects whether
something is "sorted after", affects "position in alphabet",
and affects whether characters are "regard as the same".
The main example is that SHARP S moved due to a "fix"
for Bug #27877 incorrect german order in utf8_general_ci.
The new SHARP S position in the alphabet is: equal to 's'.
The recommended procedure is in the MySQL Reference Manual,
"2.4.3. Checking Whether Tables or Indexes Must Be Rebuilt"
http://dev.mysql.com/doc/refman/5.1/en/checking-table-incompatibilities.html
... which is the page I mentioned before.

> * added question: does collation ever introduce
> equivalence, or always merely order?
Collation affects decisions about "equality".
And it's possible that keys which were not duplicate before,
will be duplicate now.

> Drupal has an open bug ... initiated by the fix to this MySQL bug
But there is no magic fix to this MySQL "bug",
so the recommended procedure is as I said before.
[21 Jul 2010 22:44] Daniel Kudwien
Closely related to this issue, containing further details, and perhaps the cause (not verified) for these (critical) problems may be http://bugs.mysql.com/bug.php?id=27877
[28 Mar 2012 7:00] Stewart Smith
This bug is no longer 'In Progress'. It's actually fixed in 5.1.62 and is missing from the release notes.

The revision that fixed it is:

revno: 3560.16.1
revision-id: alexander.barkov@oracle.com-20120123090710-ax8yuq9z5l2efwyj
parent: nirbhay.choubey@oracle.com-20120117034058-cb9ivmns2z3wkfrw
committer: Alexander Barkov <alexander.barkov@oracle.com>
branch nick: mysql-5.1.b11752408
timestamp: Mon 2012-01-23 13:07:10 +0400
message:
  Bug#11752408 - 43593: DUMP/BACKUP/RESTORE/UPGRADE TOOLS FAILS BECAUSE OF UTF8_GENERAL_CI
  
  Introducing new collations:
  utf8_general_mysql500_ci and ucs2_general_mysql500_ci,
  to reproduce behaviour of utf8_general_ci and ucs2_general_ci
  from mysql-5.1.23 (and earlier).
  
  The collations are added to simplify upgrade from mysql-5.1.23 and earlier.
  
  Note: The patch does not make new server start over old data automatically.
  Some manual upgrade procedures are assumed.
  
  Paul: please get in touch with me to discuss upgrade procedures
  when documenting this bug.
  
  modified:
    include/m_ctype.h
    mysql-test/r/ctype_utf8.result
    mysql-test/t/ctype_utf8.test
    mysys/charset-def.c
    strings/ctype-ucs2.c
    strings/ctype-utf8.c
[2 Apr 2012 18:56] Paul DuBois
Noted in 5.1.62, 5.5.21, 5.6.5 changelogs.

New utf8_general_mysql500_ci ucs2_general_mysql500_ci collations have
been added that preserve the behavior of utf8_general_ci and 
ucs2_general_ci from versions of MySQL previous to 5.1.24. Bug #27877
corrected an error in the original collations but introduced an
incompatibility for for columns that contain German 'ß' LATIN SMALL
LETTER SHARP S. (As a result of the fix, that character compares
equal to characters with which it previously compared different.) A
symptom of the problem after upgrading from a version older than
5.1.24 to version 5.1.24 or newer is that CHECK TABLE produces this
error: 

Table upgrade required. Please do "REPAIR TABLE `t`" or dump/reload 
to fix it! 

Unfortunately, REPAIR TABLE could not fix the problem. The new
collations permit older tables created before MySQL 5.1.24 to be
upgraded to current versions of MySQL.

To convert an affected table after a binary upgrade that leaves the
table files in place, you can alter the table to use the new
collation. Suppose that the table t1 contains one or more problematic
utf8 columns. To convert the table at the table level, use a
statement like this:

ALTER TABLE t1
CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_mysql500_ci;

To apply the change on a column-specific basis, use a statement like
this (be sure to repeat the column definition as originally specified
except for the COLLATE clause):

ALTER TABLE t1
MODIFY c1 CHAR(N) CHARACTER SET utf8 COLLATE utf8_general_mysql500_ci;

After making the appropriate changes, CHECK TABLE should report no
error. 

To upgrade the table using a dump and reload procedure, dump the 
table using mysqldump, modify the CREATE TABLE statement in the dump
file to use the new collation, and reload the table.