MySQL Bugs: #70767: LOCATE() and INSTR() claim to be multibyte aware, but are not

Bug #70767	LOCATE() and INSTR() claim to be multibyte aware, but are not
Submitted:	29 Oct 2013 19:09	Modified:	19 May 2014 23:29
Reporter:	Arthur O'Dwyer	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S2 (Serious)
Version:	5.5.31, 5.5.34	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
The docs for LOCATE() and INSTR() claim:

http://dev.mysql.com/doc/refman/5.5/en/string-functions.html#function_locate

> This function is multi-byte safe, and is case-sensitive only
> if at least one argument is a binary string.

However, you can easily observe that they do not (completely) respect collations when looking for one string inside another one. It seems that what's happening is that MySQL looks for a substring which is collation-equal to the target **which has exactly the same length in bytes** as the target. This is only rarely true.

In the example below, we're using a collation in which "Å"="A", but LOCATE("Å","á") returns 1 and LOCATE("A","á") returns 0. Notice that neither of the needles is *bitwise* present in the haystack, so we're definitely doing *something* involving collations, but what we're doing is clearly wrong.

How to repeat:
CREATE TABLE t (needle VARCHAR(10), haystack VARCHAR(10)) COLLATE=utf8_general_ci;
INSERT INTO t VALUES ("A", "a"), ("A", "XaX");
INSERT INTO t VALUES ("A", "á"), ("A", "XáX");
INSERT INTO t VALUES ("Á", "a"), ("Á", "XaX");
INSERT INTO t VALUES ("Å", "á"), ("Å", "XáX");

SELECT needle, haystack,
       needle=haystack as `=`,
       LOCATE(needle,haystack) as `locate`,
       (haystack LIKE CONCAT('%',needle,'%')) as `like`
    FROM t;

+--------+----------+------+--------+------+
| needle | haystack | =    | locate | like |
+--------+----------+------+--------+------+
| A      | a        |    1 |      1 |    1 |
| A      | XaX      |    0 |      2 |    1 |
| A      | á        |    1 |      0 |    1 |
| A      | XáX      |    0 |      0 |    1 |
| Á      | a        |    1 |      0 |    1 |
| Á      | XaX      |    0 |      0 |    1 |
| Å      | á        |    1 |      1 |    1 |
| Å      | XáX      |    0 |      2 |    1 |
+--------+----------+------+--------+------+

Suggested fix:
The immediately obvious bug is in `my_instr_mb`. It repeatedly calls

    if (!cs->coll->strnncoll(cs, (uchar*) b, s_length, 
                                 (uchar*) s, s_length, 0))

i.e. it's incorrectly truncating the haystack `b` to `s_length`. Also, right at the top of the function, it bails out if `b_length` is less than `s_length`; that's not a legitimate optimization in the utf8_general_ci collation.

All these problems *AND MORE* are present if you switch to the even wackier utf8_unicode_ci collation, or to a UCA collation that includes contractions, but I figure utf8_general_ci suffices to demonstrate the issue.

Hello Arthur,

Thank you for the bug report.
Verified as described.

Thanks,
Umesh

Here is another test case for this bug: http://sqlfiddle.com/#!2/9bf21/6
It is also listed as a question on Stack Overflow: http://stackoverflow.com/questions/20923186

http://bugs.mysql.com/bug.php?id=72685 is related but not identical.
(Bug 70767 is about what consitutes a "match"; bug 72685 is about what integer value is returned after a match is found.)

Furthermore, "latin1" seems to be hard-coded:

mysql> select instr('abc sakā def', 'saka') collate utf8_unicode_ci;
ERROR 1253 (42000): COLLATION 'utf8_unicode_ci' is not valid for CHARACTER SET 'latin1'

That error occurs in spite of
mysql> show variables like 'char%';
+--------------------------+-----------------------------------------------+
| Variable_name            | Value                                         |
+--------------------------+-----------------------------------------------+
| character_set_client     | utf8                                          |
| character_set_connection | utf8                                          |
| character_set_database   | utf8                                          |
| character_set_filesystem | binary                                        |
| character_set_results    | utf8                                          |
| character_set_server     | utf8                                          |
| character_set_system     | utf8                                          |
| character_sets_dir       | ...
+--------------------------+-----------------------------------------------+
and everything else I can think of being utf8.

This issue still occurs when using MySQL 5.6 as demonstrated by the following SQL Fiddle.

http://www.sqlfiddle.com/#!9/b4fbf/2

Also tried MySQL 8.0.4 with same results but MariaDB gave different results.  Regardless the documentation below should be updated to explain how diacritical marks are handled to avoid any confusion.

https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_instr

This issue still occurs when using MySQL 8.0 and was raised over 6 years ago.  If you aren't going to fix then at least the documentation below should be updated to explain collations aren't respected to avoid any confusion.

https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_instr