MySQL Bugs: #70341: Key lengths lead to incomplete result sets; unhinted scans prefer such indexes

Bug #70341	Key lengths lead to incomplete result sets; unhinted scans prefer such indexes
Submitted:	13 Sep 2013 19:00	Modified:	1 Nov 2013 9:45
Reporter:	Arthur O'Dwyer	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S2 (Serious)
Version:	5.5.31, 5.6	OS:	Any
Assigned to:	Jørgen Løland	CPU Architecture:	Any

Description:
"Key length" just seems like a terrible idea in general.

If you create an index on a TEXT or BLOB (or LONGTEXT or LONGBLOB) column, you must specify how many characters of the field are going to be indexed. Table lookups using that index will do the equivalent of strncmp() instead of strcmp().

Using strncmp means that range lookups with excluded endpoints ("const1" < mycol and mycol < "const2") will often spuriously return too few results.

Worse --- and this is the real "bug", not just "misfeature" --- if your SELECT query isn't explicitly hinted, MySQL will often choose such an index in preference to a full table scan, yielding wrong results.

Worser, MySQL will choose such an index even in preference to another *identical* index with a longer key length!

How to repeat:
create table t (id int, data text, index k3(data(3)), index k10(data(10)));
insert into t values (1,"banana1");
insert into t values (2,"banana2");
insert into t values (3,"banana3");
insert into t values (4,"banana4");
insert into t values (5,"banana5");
insert into t values (6,"banana6");
insert into t values (7,"banana7");
insert into t values (8,"banana8");
insert into t values (9,"banana9");

select * from t where "banana3" < data and data < "banana7";             -- incorrectly returns the empty set
select * from t use index() where "banana3" < data and data < "banana7"; -- table scan produces correct results
select * from t use index(k10) where "banana3" < data and data < "banana7"; -- Even choosing the other index would have produced correct results, in this case!

Suggested fix:
Range scans using indexes with a "key length" should correctly strcmp the whole string near the endpoints of the scan (i.e., where strncmp would return 0).

I don't plan to create a patch for this; the ball's in your court.

Thank you for the bug report.

c:\dbs>net start mysqld56
The MySQLD56 service is starting..
The MySQLD56 service was started successfully.

c:\dbs>56

c:\dbs>c:\dbs\5.6\bin\mysql -uroot --port=3560 --prompt="mysql 5.6 > "
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.6.14 Source distribution

Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql 5.6 > create database yy;
Query OK, 1 row affected (0.02 sec)

mysql 5.6 > use yy
Database changed
mysql 5.6 > create table t (id int, data text, index k3(data(3)), index k10(data(10)));
Query OK, 0 rows affected (0.64 sec)

mysql 5.6 > insert into t values (1,"banana1");
Query OK, 1 row affected (0.03 sec)

mysql 5.6 > insert into t values (2,"banana2");
Query OK, 1 row affected (0.03 sec)

mysql 5.6 > insert into t values (3,"banana3");
Query OK, 1 row affected (0.39 sec)

mysql 5.6 > insert into t values (4,"banana4");
Query OK, 1 row affected (0.03 sec)

mysql 5.6 > insert into t values (5,"banana5");
Query OK, 1 row affected (0.03 sec)

mysql 5.6 > insert into t values (6,"banana6");
Query OK, 1 row affected (0.03 sec)

mysql 5.6 > insert into t values (7,"banana7");
Query OK, 1 row affected (0.02 sec)

mysql 5.6 > insert into t values (8,"banana8");
Query OK, 1 row affected (0.03 sec)

mysql 5.6 > insert into t values (9,"banana9");
Query OK, 1 row affected (0.05 sec)

mysql 5.6 >
mysql 5.6 > select * from t where "banana3" < data and data < "banana7";             -- incorrectly returns the empty set
Empty set (0.00 sec)

mysql 5.6 > select * from t use index() where "banana3" < data and data < "banana7"; -- table scan produces correct results
+------+---------+
| id   | data    |
+------+---------+
|    4 | banana4 |
|    5 | banana5 |
|    6 | banana6 |
+------+---------+
3 rows in set (0.00 sec)

mysql 5.6 > select * from t use index(k10) where "banana3" < data and data < "banana7"; -- Even choosing the other index would have produced correct results, in this case!
+------+---------+
| id   | data    |
+------+---------+
|    4 | banana4 |
|    5 | banana5 |
|    6 | banana6 |
+------+---------+
3 rows in set (0.00 sec)

mysql 5.6 >

Noted in 5.6.15, 5.7.3 changelogs.

In some cases, range conditions over indexes defined on column
prefixes returned incomplete result sets. (For example, SELECT ...
WHERE 'abcdef1' < col_name AND col_name < 'abcdef9', where the index
on col_name indexed only the first 6 characters.)

@Paul: Does "Noted in x.y.z changelogs" mean that the bug has actually been fixed in version x.y.z?
Same question in re bug number 69841.
Lastly, does the absence of "5.5.z" in this bug's case mean that you're not fixing it for 5.5 at all, or just that the fix is still in progress?

(Hoped-for answers: yes, yes, and in-progress.)

Arthur: Yes, "Noted in x.y.z changelogs" means that the bug has been fixed in these versions. The absence of 5.5.z means that it is not fixed there. There are no plans to fix this in MySQL 5.5.z as far as I know.

5.6$ bzr log -r 5479
------------------------------------------------------------
revno: 5479
committer: Jorgen Loland <jorgen.loland@oracle.com>
branch nick: mysql-5.6
timestamp: Mon 2013-09-30 12:27:07 +0200
message:
  BUG#17458273: KEY LENGTHS LEAD TO INCOMPLETE RESULT SETS; 
                UNHINTED SCANS PREFER SUCH INDEXES
  
  Consider a table with an index over a prefix of column 'x' and
  a query with a range condition over 'x':
  
  CREATE TABLE (... INDEX (x(4)))
  SELECT ... WHERE 'foobar2' < x and x < 'foobar5';
  
  The index will not be able to distinguish between rows with 
  x-values like "foobar", "foobaz" or "foobar3" since only the 
  first four characters ("foob") will be stored in the index.
  
  Since all rows with an x-value prefix "foob" look equal, the
  range optimizer has to translate '<' and '>' to '<=' and '>=',
  respectively, for such indexes.
  
  In this bug, only "'foobar2' < x" is correctly translated. 
  Later, when key_and() is executed on the the two ranges
  "'foob' <= x" and "NULL < x < 'foob'", key_and() correctly
  finds that these ranges together form an impossible condition.
  The fix is to also translate '<' to '<=' for prefix indexes
  so that key_and() can ultimately create the range 
  "'foob' <= x <= 'foob'".