Bug #20447 Problem with prefix keys with contractions and expansions
Submitted: 14 Jun 2006 5:06 Modified: 24 Nov 2008 15:31
Reporter: Alexander Barkov Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:4.0, 4.1, 5.0, 5.1 OS:Any (any)
Assigned to: Assigned Account CPU Architecture:Any
Tags: disabled

[14 Jun 2006 5:06] Alexander Barkov
Description:
When using a collation having contractions, prefix keys
may not work correctly. For example, in Czech 'ch' is
sorted between 'h' and 'i'.

All UCA based collations with contractions,
as well as cp1251_czech_ci and latin2_czech_ci are affected. 

How to repeat:
in 4.0 start server with --default-character-set=czech
in 4.1 use CHARACTER SET latin2 COLLATE latin2_czech_ci
in the table definition:

DROP TABLE IF EXISTS t1;

CREATE TABLE t1 (
  a char(128) NOT NULL default '',
  KEY a (a(1))
) TYPE=MYISAM;
SHOW CREATE TABLE t1;

INSERT INTO t1 VALUES ('h');
INSERT INTO t1 VALUES ('ch');
INSERT INTO t1 VALUES ('i');

explain select a as prefix_key from t1 where a between 'h' and 'i';
select a as prefix_key from t1 where a between 'h' and 'i';
alter table t1 drop key a;
select a as no_key from t1 where a between 'h' and 'i';
alter table t1 add key (a);
select a as full_key from t1 where a between 'h' and 'i';

The results are:
+-------+-------+---------------+------+---------+------+------+-------------+
| table | type  | possible_keys | key  | key_len | ref  | rows | Extra       |
+-------+-------+---------------+------+---------+------+------+-------------+
| t1    | range | a             | a    |       1 | NULL |    1 | Using where |
+-------+-------+---------------+------+---------+------+------+-------------+
+------------+
| prefix_key |
+------------+
| h          |
| i          |
+------------+
+--------+
| no_key |
+--------+
| h      |
| ch     |
| i      |
+--------+
+----------+
| full_key |
+----------+
| h        |
| ch       |
| i        |
+----------+

I.e. the first select with prefix key does not return 'ch', which is wrong.
The second and the third selects correctly return all three records.

Suggested fix:
Some modifications are required in prefix key handling.
Probably we should only put both characters 'c' and 'h' when
storing 'ch' in a prefix index, and put 'c' alone only when
it is not followed by 'c'.

That means the number in parentheses in a prefix key length
may change its meaning from "N characters" to "N collation elements".
[14 Jun 2006 5:08] Alexander Barkov
Sorry, there was a typo in the previous post. It should be:

when 'c' is not followed by 'h'.
[15 Jun 2006 6:41] MySQL Verification Team
Thank you for the bug report. Verified as described just changed
CHARACTER SET latin2 COLLATE latin2_czech_ci to
CHARACTER SET latin2 COLLATE latin2_czech_cs
[8 Dec 2006 5:36] Alexander Barkov
The same problem happens with expansions:

mysql> create table t7 (s1 char(5) character set latin1 collate
latin1_german2_ci) engine=myisam;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into t7 values ('UE'), ('Ü');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select * from t7 where s1 = 'UE';
+------+
| s1   |
+------+
| UE   |
| Ü   |
+------+
2 rows in set (0.00 sec)

mysql> create index it7 on t7 (s1);
Query OK, 2 rows affected (0.01 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select * from t7 where s1 = 'UE';
+------+
| s1   |
+------+
| UE   |
| Ü   |
+------+
2 rows in set (0.00 sec)

mysql> drop index it7 on t7;
Query OK, 2 rows affected (0.01 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> create index it7 on t7 (s1(1));
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select * from t7 where s1 = 'UE';
+------+
| s1   |
+------+
| UE   |
+------+
1 row in set (0.01 sec)
[16 Oct 2007 22:07] Lars Thalmann
See also BUG#20353.
[6 Dec 2007 15:09] Lars Thalmann
Mikael thinks that the only solution to this bug is 
to disable prefix indexes on multi-byte charsets.

He says that anyone with a different opinion is welcome
to come up with a solution. :)
[26 Jan 2008 6:48] Alexander Barkov
We have discussed this problem with Peter and Serg.
The easiest solution to fix this problem for Unicode character
sets is to use so called "Private Use Area" Unicode code points
to enumerate contractions. Contractions will be represented
as a single code in indexes. This approach helps to avoid
changing of prefix key lengths, and should be easy to implement.
[2 Jul 2008 20:59] Sveta Smirnova
There is similar bug #37820 occured only with InnoDB storage engine.