MySQL Bugs: #20447: Problem with prefix keys with contractions and expansions

Bug #20447	Problem with prefix keys with contractions and expansions
Submitted:	14 Jun 2006 5:06	Modified:	24 Nov 2008 15:31
Reporter:	Alexander Barkov	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S2 (Serious)
Version:	4.0, 4.1, 5.0, 5.1	OS:	Any (any)
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	disabled

Description:
When using a collation having contractions, prefix keys
may not work correctly. For example, in Czech 'ch' is
sorted between 'h' and 'i'.

All UCA based collations with contractions,
as well as cp1251_czech_ci and latin2_czech_ci are affected. 

How to repeat:
in 4.0 start server with --default-character-set=czech
in 4.1 use CHARACTER SET latin2 COLLATE latin2_czech_ci
in the table definition:

DROP TABLE IF EXISTS t1;

CREATE TABLE t1 (
  a char(128) NOT NULL default '',
  KEY a (a(1))
) TYPE=MYISAM;
SHOW CREATE TABLE t1;

INSERT INTO t1 VALUES ('h');
INSERT INTO t1 VALUES ('ch');
INSERT INTO t1 VALUES ('i');

explain select a as prefix_key from t1 where a between 'h' and 'i';
select a as prefix_key from t1 where a between 'h' and 'i';
alter table t1 drop key a;
select a as no_key from t1 where a between 'h' and 'i';
alter table t1 add key (a);
select a as full_key from t1 where a between 'h' and 'i';

The results are:
+-------+-------+---------------+------+---------+------+------+-------------+
| table | type  | possible_keys | key  | key_len | ref  | rows | Extra       |
+-------+-------+---------------+------+---------+------+------+-------------+
| t1    | range | a             | a    |       1 | NULL |    1 | Using where |
+-------+-------+---------------+------+---------+------+------+-------------+
+------------+
| prefix_key |
+------------+
| h          |
| i          |
+------------+
+--------+
| no_key |
+--------+
| h      |
| ch     |
| i      |
+--------+
+----------+
| full_key |
+----------+
| h        |
| ch       |
| i        |
+----------+

I.e. the first select with prefix key does not return 'ch', which is wrong.
The second and the third selects correctly return all three records.

Suggested fix:
Some modifications are required in prefix key handling.
Probably we should only put both characters 'c' and 'h' when
storing 'ch' in a prefix index, and put 'c' alone only when
it is not followed by 'c'.

That means the number in parentheses in a prefix key length
may change its meaning from "N characters" to "N collation elements".

Sorry, there was a typo in the previous post. It should be:

when 'c' is not followed by 'h'.

Thank you for the bug report. Verified as described just changed
CHARACTER SET latin2 COLLATE latin2_czech_ci to
CHARACTER SET latin2 COLLATE latin2_czech_cs

The same problem happens with expansions:

mysql> create table t7 (s1 char(5) character set latin1 collate
latin1_german2_ci) engine=myisam;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into t7 values ('UE'), ('Ü');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select * from t7 where s1 = 'UE';
+------+
| s1   |
+------+
| UE   |
| Ü   |
+------+
2 rows in set (0.00 sec)

mysql> create index it7 on t7 (s1);
Query OK, 2 rows affected (0.01 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select * from t7 where s1 = 'UE';
+------+
| s1   |
+------+
| UE   |
| Ü   |
+------+
2 rows in set (0.00 sec)

mysql> drop index it7 on t7;
Query OK, 2 rows affected (0.01 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> create index it7 on t7 (s1(1));
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select * from t7 where s1 = 'UE';
+------+
| s1   |
+------+
| UE   |
+------+
1 row in set (0.01 sec)

See also BUG#20353.

Mikael thinks that the only solution to this bug is 
to disable prefix indexes on multi-byte charsets.

He says that anyone with a different opinion is welcome
to come up with a solution. :)

We have discussed this problem with Peter and Serg.
The easiest solution to fix this problem for Unicode character
sets is to use so called "Private Use Area" Unicode code points
to enumerate contractions. Contractions will be represented
as a single code in indexes. This approach helps to avoid
changing of prefix key lengths, and should be easy to implement.

There is similar bug #37820 occured only with InnoDB storage engine.