Bug #70767 LOCATE() and INSTR() claim to be multibyte aware, but are not
Submitted: 29 Oct 2013 19:09 Modified: 19 May 2014 23:29
Reporter: Arthur O'Dwyer Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:5.5.31, 5.5.34 OS:Any
Assigned to: CPU Architecture:Any

[29 Oct 2013 19:09] Arthur O'Dwyer
Description:
The docs for LOCATE() and INSTR() claim:

http://dev.mysql.com/doc/refman/5.5/en/string-functions.html#function_locate

> This function is multi-byte safe, and is case-sensitive only
> if at least one argument is a binary string.

However, you can easily observe that they do not (completely) respect collations when looking for one string inside another one. It seems that what's happening is that MySQL looks for a substring which is collation-equal to the target **which has exactly the same length in bytes** as the target. This is only rarely true.

In the example below, we're using a collation in which "Å"="A", but LOCATE("Å","á") returns 1 and LOCATE("A","á") returns 0. Notice that neither of the needles is *bitwise* present in the haystack, so we're definitely doing *something* involving collations, but what we're doing is clearly wrong.

How to repeat:
CREATE TABLE t (needle VARCHAR(10), haystack VARCHAR(10)) COLLATE=utf8_general_ci;
INSERT INTO t VALUES ("A", "a"), ("A", "XaX");
INSERT INTO t VALUES ("A", "á"), ("A", "XáX");
INSERT INTO t VALUES ("Á", "a"), ("Á", "XaX");
INSERT INTO t VALUES ("Å", "á"), ("Å", "XáX");

SELECT needle, haystack,
       needle=haystack as `=`,
       LOCATE(needle,haystack) as `locate`,
       (haystack LIKE CONCAT('%',needle,'%')) as `like`
    FROM t;

+--------+----------+------+--------+------+
| needle | haystack | =    | locate | like |
+--------+----------+------+--------+------+
| A      | a        |    1 |      1 |    1 |
| A      | XaX      |    0 |      2 |    1 |
| A      | á        |    1 |      0 |    1 |
| A      | XáX      |    0 |      0 |    1 |
| Á      | a        |    1 |      0 |    1 |
| Á      | XaX      |    0 |      0 |    1 |
| Å      | á        |    1 |      1 |    1 |
| Å      | XáX      |    0 |      2 |    1 |
+--------+----------+------+--------+------+

Suggested fix:
The immediately obvious bug is in `my_instr_mb`. It repeatedly calls

    if (!cs->coll->strnncoll(cs, (uchar*) b, s_length, 
                                 (uchar*) s, s_length, 0))

i.e. it's incorrectly truncating the haystack `b` to `s_length`. Also, right at the top of the function, it bails out if `b_length` is less than `s_length`; that's not a legitimate optimization in the utf8_general_ci collation.

All these problems *AND MORE* are present if you switch to the even wackier utf8_unicode_ci collation, or to a UCA collation that includes contractions, but I figure utf8_general_ci suffices to demonstrate the issue.
[30 Oct 2013 8:27] MySQL Verification Team
Hello Arthur,

Thank you for the bug report.
Verified as described.

Thanks,
Umesh
[6 Jan 2014 11:09] Matthijs Langenberg
Here is another test case for this bug: http://sqlfiddle.com/#!2/9bf21/6
It is also listed as a question on Stack Overflow: http://stackoverflow.com/questions/20923186
[19 May 2014 23:29] Arthur O'Dwyer
http://bugs.mysql.com/bug.php?id=72685 is related but not identical.
(Bug 70767 is about what consitutes a "match"; bug 72685 is about what integer value is returned after a match is found.)
[16 Jun 2015 21:36] Rick James
Furthermore, "latin1" seems to be hard-coded:

mysql> select instr('abc sakā def', 'saka') collate utf8_unicode_ci;
ERROR 1253 (42000): COLLATION 'utf8_unicode_ci' is not valid for CHARACTER SET 'latin1'

That error occurs in spite of
mysql> show variables like 'char%';
+--------------------------+-----------------------------------------------+
| Variable_name            | Value                                         |
+--------------------------+-----------------------------------------------+
| character_set_client     | utf8                                          |
| character_set_connection | utf8                                          |
| character_set_database   | utf8                                          |
| character_set_filesystem | binary                                        |
| character_set_results    | utf8                                          |
| character_set_server     | utf8                                          |
| character_set_system     | utf8                                          |
| character_sets_dir       | ...
+--------------------------+-----------------------------------------------+
and everything else I can think of being utf8.
[22 Feb 2018 18:52] Mark Guinness
This issue still occurs when using MySQL 5.6 as demonstrated by the following SQL Fiddle.

http://www.sqlfiddle.com/#!9/b4fbf/2

Also tried MySQL 8.0.4 with same results but MariaDB gave different results.  Regardless the documentation below should be updated to explain how diacritical marks are handled to avoid any confusion.

https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_instr
[14 Feb 2020 21:55] Mark Guinness
This issue still occurs when using MySQL 8.0 and was raised over 6 years ago.  If you aren't going to fix then at least the documentation below should be updated to explain collations aren't respected to avoid any confusion.

https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_instr