Bug #39298 Prefix key search returns wrong result for multi-level collations
Submitted: 8 Sep 2008 6:42 Modified: 2 Mar 2009 12:50
Reporter: Alexander Barkov Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:4.0, 4.1, 5.0, 5.1, 6.0 OS:Any
Assigned to: CPU Architecture:Any

[8 Sep 2008 6:42] Alexander Barkov
Description:
Optimizer filters out good values when using prefix keys
for WHERE optimization with multi-level collations.

How to repeat:
DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a varchar(10) character set cp1250 collate cp1250_czech_cs default NULL, KEY a (a(2)));
INSERT INTO t1 VALUES ('aaa'),('aab'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa');
INSERT INTO t1 VALUES ('aaa'),('aab'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa');
INSERT INTO t1 VALUES ('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa');
INSERT INTO t1 VALUES ('nnn'),('nnN'),('nna'),('nmn'),('nMl'),('nMn'),('nmL');
SELECT * FROM t1 IGNORE KEY(a) WHERE a>'nmL';
SELECT * FROM t1 WHERE a>'nmL';

The two SELECT statements return different row count:

mysql> SELECT * FROM t1 IGNORE KEY(a) WHERE a>'nmL'; // correct result
+------+
| a    |
+------+
| nnn  | 
| nnN  | 
| nna  | 
| nmn  | 
| nMn  | 
+------+
5 rows in set (0.00 sec)

mysql> SELECT * FROM t1 WHERE a>'nmL';  // wrong result
+------+
| a    |
+------+
| nmn  | 
| nnn  | 
| nnN  | 
| nna  | 
+------+
4 rows in set (0.00 sec)

Suggested fix:
Explanation:
===========

The above happens because in multi-level case sensitive
collations the kind of the third letter ('n' vs 'l')
has more importance that the case of the second letter
('M' vs 'm').

When we create a prefix index a(2), the order of key values
in the index is different comparing to the order of the 
actual values:

-- Compare three letters - actual values
mysql> select strcmp(_cp1250 'nMn' collate cp1250_czech_cs, 'nmL');
+------------------------------------------------------+
| strcmp(_cp1250 'nMn' collate cp1250_czech_cs, 'nmL') |
+------------------------------------------------------+
|                                                    1 | 
+------------------------------------------------------+
1 row in set (0.00 sec)

-- Compare two letters - prefix key values
mysql> select strcmp(_cp1250 'nM' collate cp1250_czech_cs, 'nm');
+----------------------------------------------------+
| strcmp(_cp1250 'nM' collate cp1250_czech_cs, 'nm') |
+----------------------------------------------------+
|                                                 -1 | 
+----------------------------------------------------+
1 row in set (0.00 sec)

When processing a>'nmL', optimizer
cuts the searched string to the length of the prefix index (2)
and so it searches all keys greater than 'nm'.
Therefore it skips the key 'nM' which is smaller than 'nm',
and as a result it doesn't find the value 'nMn'.

Suggested fix:
==============
Optimizer should search keys starting from the smallest
possible key.  In the above example the smallest possible
key is 'NM' instead of 'nm'.

The following query simulates this approach and proves the idea:

SELECT * FROM t1  WHERE a>'NM' AND LEFT(a,3) >'nmL';

It does use the index (see EXPLAIN SELECT output).
It correctly returns all five good values.
[8 Sep 2008 6:55] Alexander Barkov
The same problem happens with another multi-level collation latin2_czech_cs.

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a varchar(10) character set latin2 collate latin2_czech_cs default NULL, KEY a (a(2)));
INSERT INTO t1 VALUES ('aaa'),('aab'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa');
INSERT INTO t1 VALUES ('aaa'),('aab'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa');
INSERT INTO t1 VALUES ('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa');
INSERT INTO t1 VALUES ('nnn'),('nnN'),('nna'),('nmn'),('nMl'),('nMn'),('nmL');
SELECT * FROM t1 WHERE a>'nMl'; -- returns 4 rows
SELECT * FROM t1 IGNORE KEY(a) WHERE a>'nMl'; -- returns 5 rows

Note, the value in WHERE condition to reproduce the failure is
slightly different from the value in cp1250_czech_cs test:
'nMl' vs 'nmL'.

This is because:
cp1250_czech_cs returns upper case first.
latin1_czech_cs returns lower case first.

See here for collation details:
http://www.collation-charts.org/mysql60/mysql604.latin2_czech_cs.html
http://www.collation-charts.org/mysql60/mysql604.cp1250_czech_cs.html
[8 Sep 2008 7:58] Valeriy Kravchuk
Verified with 6.0.7 from bzr:

openxs@suse:/home2/openxs/dbs/6.0> bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 6.0.7-alpha-debug Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> DROP TABLE IF EXISTS t1;
Query OK, 0 rows affected (0.06 sec)

mysql> CREATE TABLE t1 (a varchar(10) character set cp1250 collate cp1250_czech_cs default NULL,
    -> KEY a (a(2)));
Query OK, 0 rows affected (0.02 sec)

mysql> INSERT INTO t1 VALUES
    -> ('aaa'),('aab'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa');
Query OK, 10 rows affected (0.00 sec)
Records: 10  Duplicates: 0  Warnings: 0

mysql> INSERT INTO t1 VALUES
    -> ('aaa'),('aab'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa');
Query OK, 10 rows affected (0.00 sec)
Records: 10  Duplicates: 0  Warnings: 0

mysql> INSERT INTO t1 VALUES
    -> ('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa'),('aaa');
Query OK, 10 rows affected (0.00 sec)
Records: 10  Duplicates: 0  Warnings: 0

mysql> INSERT INTO t1 VALUES ('nnn'),('nnN'),('nna'),('nmn'),('nMl'),('nMn'),('nmL');
Query OK, 7 rows affected (0.00 sec)
Records: 7  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM t1 IGNORE KEY(a) WHERE a>'nmL';
+------+
| a    |
+------+
| nnn  |
| nnN  |
| nna  |
| nmn  |
| nMn  |
+------+
5 rows in set (0.01 sec)

mysql> SELECT * FROM t1 WHERE a>'nmL';
+------+
| a    |
+------+
| nmn  |
| nnn  |
| nnN  |
| nna  |
+------+
4 rows in set (0.00 sec)
[8 Sep 2008 8:04] Valeriy Kravchuk
5.1.30 and 5.0.70 from bzr are also affected.