Bug #24985 UTF8 ENUM primary key on MEMORY using BTREE causes incorrect duplicate entries
Submitted: 12 Dec 2006 0:21 Modified: 5 Apr 2007 19:42
Reporter: Ryan Korczykowski Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Memory storage engine Severity:S3 (Non-critical)
Version:5.0.27/4.1BK/5.0BK/5.1BK OS:Windows (Windows 2K/Linux)
Assigned to: Ingo Strüwing CPU Architecture:Any
Tags: BTREE, duplicate, enum, Memory, primary key, utf8

[12 Dec 2006 0:21] Ryan Korczykowski
Description:
A primary key on an ENUM type using BTREE for a MEMORY table with charset of UTF8 and collation of utf8_general_ci incorrectly recognizes duplicate entries.

How to repeat:
*make sure that BusPtid is utf8/utf8_general_ci*

CREATE TABLE `espi_lp_hash` (
  `BusPtid` enum('301362','401362','302362','403362','303362','304362','404362','305362','405362','306362','406362','407362','408362','Error')  NOT NULL default '301362',
  `Date` date NOT NULL default '0000-00-00',
  `Hour` tinyint(4) NOT NULL default '0',
  `Fraction` double NOT NULL default '0',
  PRIMARY KEY  USING BTREE (`BusPtid`,`Date`,`Hour`)
) ENGINE=MEMORY DEFAULT CHARSET=utf8;

INSERT INTO `espi_lp_hash` (`BusPtid`, `Date`, `Hour`, `Fraction`) VALUES 
('301362', '2004-04-02', 0, 0);

INSERT INTO `espi_lp_hash` (`BusPtid`, `Date`, `Hour`, `Fraction`) VALUES 
('302362', '2004-04-02', 0, 0);

#1062 - Duplicate entry '302362-2004-04-02-0' for key 1 

Suggested fix:
Workaround is to set the column charset to latin1 and collate latin1_general_ci or change to HASH index.
[12 Dec 2006 0:45] MySQL Verification Team
Thank you for the bug report. Verified as described on Suse Linux:

miguel@hegel:~/dbs/5.0> bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.0.32-debug Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> CREATE TABLE `espi_lp_hash` (
    ->   `BusPtid`
    -> enum('301362','401362','302362','403362','303362','304362','404362',
    -> '305362','405362','306362','406362','407362',
    -> '408362','Error')  NOT NULL default '301362',
    ->   `Date` date NOT NULL default '0000-00-00',
    ->   `Hour` tinyint(4) NOT NULL default '0',
    ->   `Fraction` double NOT NULL default '0',
    ->   PRIMARY KEY  USING BTREE (`BusPtid`,`Date`,`Hour`)
    -> ) ENGINE=MEMORY DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.10 sec)

mysql> INSERT INTO `espi_lp_hash` (`BusPtid`, `Date`, `Hour`, `Fraction`) VALUES 
    -> ('301362', '2004-04-02', 0, 0);
Query OK, 1 row affected (0.01 sec)

mysql> 
mysql> INSERT INTO `espi_lp_hash` (`BusPtid`, `Date`, `Hour`, `Fraction`) VALUES 
    -> ('302362', '2004-04-02', 0, 0);
ERROR 1062 (23000): Duplicate entry '302362-2004-04-02-0' for key 1
mysql> exit
Bye
[23 Mar 2007 8:55] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/22734

ChangeSet@1.2609, 2007-03-23 09:54:49+01:00, istruewing@chilla.local +5 -0
  Bug#24985 - UTF8 ENUM primary key on MEMORY using BTREE
              causes incorrect duplicate entries
  
  Keys for a BTREE index on an ENUM column of a MEMORY table
  with character set UTF8 were computed incorrectly. Many
  different column values got the same key value.
  
  Apart of possible performance problems it made unique indexes
  of this type unusable because it rejected many different
  values as duplicates.
  
  The problem was that multibyte character detection was tried
  on the internal numeric column value. Many values were not
  identified as characters. Their key value became blank filled.
  
  The solution is modelled after the equivalent behavior of
  MyISAM. A character set is defined for key segments of text
  type only. Binary segments get a pseudo binary charset attached.
  All other types do not get a charset attached.
[26 Mar 2007 5:32] Alexander Barkov
Hello Ingo, the problem is in this piece code in ha_heap.cc:

   for (; key_part != key_part_end; key_part++, seg++)
    {
      Field *field= key_part->field;

      if (pos->algorithm == HA_KEY_ALG_BTREE)
        seg->type= field->key_type();
      else
      {
        if ((seg->type = field->key_type()) != (int) HA_KEYTYPE_TEXT &&
            seg->type != HA_KEYTYPE_VARTEXT1 &&
            seg->type != HA_KEYTYPE_VARTEXT2 &&
            seg->type != HA_KEYTYPE_VARBINARY1 &&
            seg->type != HA_KEYTYPE_VARBINARY2)
          seg->type= HA_KEYTYPE_BINARY;
      }
      seg->start=   (uint) key_part->offset;
      seg->length=  (uint) key_part->length;
      seg->flag=    key_part->key_part_flag;

      seg->charset= field->charset();

In the case of ENUM (and SET - see below), seg->charset should be set to
&my_charset_bin instead of field->charset().
(most likely for both HA_KEY_ALG_BTREE and HA_KEY_ALG_HASH)

If you change the last line to this:

  if (field->real_type() == MYSQL_TYPE_ENUM)
    set->charset= &my_charset_bin;
  else
    seg->charset= field->charset();

it should fix the problem. I think no extra changes in /heap required.

Also, after fix, make sure SET is covered in the tests against the same problem:

mysql> create table t1 (c1 set('1','2'), unique using btree(c1)) engine=memory default character set utf8;
Query OK, 0 rows affected (1.03 sec)

mysql> insert into t1 values ('1');
Query OK, 1 row affected (0.00 sec)

mysql> insert into t1 values ('2');
ERROR 1062 (23000): Duplicate entry '2' for key 1
[26 Mar 2007 15:02] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/22941

ChangeSet@1.2609, 2007-03-26 17:02:15+02:00, istruewing@chilla.local +3 -0
  Bug#24985 - UTF8 ENUM primary key on MEMORY using BTREE
              causes incorrect duplicate entries
  
  Keys for BTREE indexes on ENUM and SET columns of MEMORY tables
  with character set UTF8 were computed incorrectly. Many
  different column values got the same key value.
  
  Apart of possible performance problems, it made unique indexes
  of this type unusable because it rejected many different
  values as duplicates.
  
  The problem was that multibyte character detection was tried
  on the internal numeric column value. Many values were not
  identified as characters. Their key value became blank filled.
  
  Thanks to Alexander Barkov for the patch, which sets the
  character set of ENUM and SET key segments to the pseudo
  binary character set.
[26 Mar 2007 17:40] Alexander Barkov
The patch http://lists.mysql.com/commits/22941 is ok to push.
[27 Mar 2007 8:50] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/22985

ChangeSet@1.2609, 2007-03-27 10:49:48+02:00, istruewing@chilla.local +3 -0
  Bug#24985 - UTF8 ENUM primary key on MEMORY using BTREE
              causes incorrect duplicate entries
  
  Keys for BTREE indexes on ENUM and SET columns of MEMORY tables
  with character set UTF8 were computed incorrectly. Many
  different column values got the same key value.
  
  Apart of possible performance problems, it made unique indexes
  of this type unusable because it rejected many different
  values as duplicates.
  
  The problem was that multibyte character detection was tried
  on the internal numeric column value. Many values were not
  identified as characters. Their key value became blank filled.
  
  Thanks to Alexander Barkov and Ramil Kalimullin for the patch,
  which sets the character set of ENUM and SET key segments to
  the pseudo binary character set.
[27 Mar 2007 10:41] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/22994

ChangeSet@1.2483, 2007-03-27 12:39:31+02:00, istruewing@chilla.local +1 -0
  Bug#24985 - UTF8 ENUM primary key on MEMORY using BTREE
              causes incorrect duplicate entries
  After merge fix
[30 Mar 2007 9:06] Ingo Strüwing
Queued to 5.1-engines, 5.0-engines, 4.1-engines.
[30 Mar 2007 17:28] Bugs System
Pushed into 5.1.18-beta
[30 Mar 2007 17:30] Bugs System
Pushed into 5.0.40
[30 Mar 2007 17:42] Ingo Strüwing
Pushed to 4.1.23
[30 Mar 2007 18:17] Bugs System
Pushed into 5.0.40
[5 Apr 2007 19:42] Paul DuBois
Noted in 4.1.23, 5.0.40, 5.1.18 changelogs.

Duplicate entries were not assessed correctly in a MEMORY table with
a BTREE primary key on a utf8 ENUM column.