Bug #18359 erratic results with LIKE and composite index on utf8 text column
Submitted: 20 Mar 2006 16:30 Modified: 13 Jul 2006 20:15
Reporter: Pav Lucistnik Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:4.1.18/5.0BK/5.1.BK OS:FreeBSD (FreeBSD/Linux Suse)
Assigned to: Igor Babaev

[20 Mar 2006 16:30] Pav Lucistnik
Description:
Combination of

- UTF8 charset
- text column type
- index being a partial index of text column (ie. text(10) )
- LIKE keyword and
- 2-byte character (in UTF8 notation)

gives unexpected results on queries. Take a 10-characters long text "VOLNÝ ADSL" (Ý = Y with acute), which is represented by 11 bytes in UTF8. Insert it into text type field. Create index from first 10 bytes of the column.

Query with WHERE clause: text = "VOLNÝ ADSL" returns expected results
Query with WHERE clause: text LIKE "VOLNÝ ADSL" returns no results
Query with WHERE clause: text LIKE "%VOLNÝ ADSL" returns expected results
Query with WHERE clause: text LIKE "VOLNY ADSL" returns expected results (in collation Ý equals Y)

Now change the index to first 11 bytes and insert longer strings with same prefix, eg. "VOLNÝ ADSL Office". Observe erratic behaviour.

Real world test case attached.

How to repeat:
See http://www.oook.cz/bsd/mysqlbug.txt

(Was too big for bug tracker.)
[20 Mar 2006 16:34] Miguel Solorzano
Test case file

Attachment: mysqlbug.txt (text/plain), 13.37 KiB.

[20 Mar 2006 18:25] Miguel Solorzano
Thank you for the bug report.
[23 Jun 2006 3:17] Igor Babaev
This is another manifestation of the bug #16674.

Note that here we have a wrong result set for a query in 4.1 as well.

The problem can be demonstrated with a simpler example:

+--------------+
| VERSION()    |
+--------------+
| 4.1.21-debug |
+--------------+
1 row in set (0.26 sec)

mysql> SET NAMES latin2;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE t1 (
    ->   id int(11) NOT NULL default '0',
    ->   tid int(11) NOT NULL default '0',
    ->   val text NOT NULL,
    ->   INDEX idx(tid, val(10))
    -> ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO t1 VALUES
    ->   (40988,72,'VOLNÝ ADSL'),(41009,72,'VOLNÝ ADSL'),
    ->   (41032,72,'VOLNÝ ADSL'),(41038,72,'VOLNÝ ADSL'),
    ->   (41063,72,'VOLNÝ ADSL'),(41537,72,'VOLNÝ ADSL Office'),
    ->   (42141,72,'VOLNÝ ADSL'),(42565,72,'VOLNÝ ADSL Combi'),
    ->   (42749,72,'VOLNÝ ADSL'),(44205,72,'VOLNÝ ADSL');
Query OK, 10 rows affected (0.01 sec)
Records: 10  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM t1 WHERE tid=72 and val LIKE 'VOLNY ADSL';
+-------+-----+------------+
| id    | tid | val        |
+-------+-----+------------+
| 40988 |  72 | VOLNÝ ADSL |
| 41009 |  72 | VOLNÝ ADSL |
| 41032 |  72 | VOLNÝ ADSL |
| 41038 |  72 | VOLNÝ ADSL |
| 41063 |  72 | VOLNÝ ADSL |
| 42141 |  72 | VOLNÝ ADSL |
| 42749 |  72 | VOLNÝ ADSL |
| 44205 |  72 | VOLNÝ ADSL |
+-------+-----+------------+
8 rows in set (0.00 sec)

mysql> SELECT * FROM t1 WHERE tid=72 and val LIKE 'VOLNÝ ADSL';
Empty set (0.00 sec)
[23 Jun 2006 3:40] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/8117
[5 Jul 2006 18:10] Evgeny Potemkin
This was another manifestation of the problems fixed in the
patch for bug 16674.
Wrong calculation of length of the search prefix in the pattern
string led here to a wrong result set for a query in 4.1. 
The bug could be demonstrated for any multi-byte character set. 

Fixed in 4.1.21, 5.0.24, 5.1.12
[13 Jul 2006 20:15] Paul Dubois
Noted in 4.1.21, 5.0.25 (not 5.0.24), 5.1.12 changelogs.