Bug #4521 unique key prefix interacts poorly with utf8
Submitted: 12 Jul 2004 17:46 Modified: 30 Oct 2004 9:17
Reporter: Linus Upson Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:4.1 OS:Linux (RedHat ES 3)
Assigned to: Heikki Tuuri CPU Architecture:Any

[12 Jul 2004 17:46] Linus Upson
Description:
If you have a unique key which indexes only a prefix of a utf8 varchar column, a couple of problems arise:

1. The unique key with a prefix of N appears index 3*N bytes of the column not N characters. I reported this previously to the bugs@lists.mysql.com mailing list.

2. If you have two records which have the same first N characters but differ in the first 3*N bytes, then you can't select the records using an equality test.

These problems do not exist if the entire column is indexed or if the character set is latin1.

How to repeat:
The following SQL demonstrates the problems:

    DROP TABLE IF EXISTS `t`;
    CREATE TABLE `t` (
      `c` varchar(100) NOT NULL default '',
      UNIQUE KEY `c` (`c`(1))
    ) ENGINE=MyISAM DEFAULT CHARSET=utf8;

    insert into t values ('ab');
    select * from t where c='ab';
    +----+
    | c  |
    +----+
    | ab |
    +----+
    1 row in set (0.00 sec)

So far, so good. Now this insert should fail, but it doesn't:

    insert into t values ('abc');
    Query OK, 1 row affected (0.00 sec)

Now the first row disappears:

    select * from t where c='ab';
    Empty set (0.00 sec)

But we can find it with a like where clause:

    select * from t where c like 'ab';
    +----+
    | c  |
    +----+
    | ab |
    +----+
    1 row in set (0.00 sec)

Suggested fix:
1. The unique key prefix should specify N characters, not 3*N bytes for utf8.
2. The statement "select * from t where c='ab';" from above should select the one correct row.
[14 Jul 2004 23:18] Dean Ellis
Verified against 4.1.4 with supplied test case.

EXPLAIN reports impossible WHERE if the index is present.

Thank you for the report.
[12 Aug 2004 15:19] Alexander Barkov
It should be fixed for all handlers:
- MyISAM
- HEAP, HASH INDEX
- HEAP, BTREE INDEX
- INNODB
- BDB
[12 Aug 2004 17:18] Alexander Barkov
This should be fixed for:
CHAR
VARCHAR
TEXT

As internal implementations are different for each type
and for each engine, the test suit should be extended
to prove that every (type,engine) pair works fine.
[12 Aug 2004 17:28] Alexander Barkov
Also, UCS2 should be checked, it is a special case.
[19 Aug 2004 13:15] Alexander Barkov
MyISAM and HEAP tables are now fixed.
[23 Aug 2004 16:44] Alexander Barkov
There is a second part of the same problem with prefix keys
on a multibyte string column:

DROP TABLE IF EXISTS `test`;
CREATE TABLE `test` (
  `str` varchar(255) NOT NULL default '',
  KEY `str` (`str`(2))
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

INSERT INTO `test` VALUES ('str');
INSERT INTO `test` VALUES ('str2'); // at least 2 rows
select * from test where str='str'; // 0 results - wrong
select * from test where str LIKE 'str'; // 1 results
[14 Sep 2004 13:01] Heikki Tuuri
Jan Lindström has a patch for this for InnoDB tables. I still need to review his code.

--Heikki
[30 Oct 2004 9:17] Heikki Tuuri
Fixed in InnoDB in 4.0.6.