Bug #72685 LOCATE/INSTR with mixed collations returns inconsistent results
Submitted: 19 May 2014 20:20 Modified: 20 May 2014 6:58
Reporter: Arthur O'Dwyer Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:5.5.31, 5.5.34, 5.6.19, 5.7.5 OS:Any
Assigned to: CPU Architecture:Any

[19 May 2014 20:20] Arthur O'Dwyer
Description:
The semantics of LOCATE() seem to mix up "wide character" and "byte" in inconsistent ways. To wit: the return value of LOCATE() is computed by

    let temp1 = arg3
    skip temp1 characters in string arg2, according to arg2's collation
    let s2 = the remainder of arg2, converted to the common collation of arg1 and arg2
    let temp2 = the number of characters in s2 preceding the match, according to s2's collation
    return (temp1 + temp2)

In other words, we're adding quantities that represent characters as counted in different collations: arg3 characters counted in arg2's collation, plus temp2 characters counted in s2's collation. Depending on the value of arg3, we can make this count come out to just about anything we like.

Example below: LOCATE(x,y,6) = 9, but LOCATE(x,y,7) = 8.

There's a separate bug out for the fact that LOCATE doesn't respect collations when it's actually searching for the match:
http://bugs.mysql.com/bug.php?id=70767
This bug is just about the wacky non-linear behavior of "arg3" in my example below.

How to repeat:
select hex("ó");

    +-----------+
    | hex("ó")  |
    +-----------+
    | C3B3      |
    +-----------+

select locate(0xB3, 'tátátáómo estás', 6);

    +-----------------------------------------+
    | locate(0xB3, 'tátátáómo estás', 6)      |
    +-----------------------------------------+
    |                                       9 |
    +-----------------------------------------+

select locate(0xB3, 'tátátáómo estás', 7);

    +-----------------------------------------+
    | locate(0xB3, 'tátátáómo estás', 7)      |
    +-----------------------------------------+
    |                                       8 |
    +-----------------------------------------+

Suggested fix:
It would make sense to me if LOCATE(x,y,arg3) always converted both x and y to their common collation *first*, then returned a count of characters in that common collation (and treated arg3 as a count of characters in that common collation).
[19 May 2014 22:51] Arthur O'Dwyer
There's another related bug in LOCATE, obvious from reading the code: If the first argument is the empty string, then we'll skip arg3 *characters* from arg2, but then return the number of *bytes* that we skipped. This is a simple one-character typo in Item_func_locate::val_int(). It returns "start + 1" when the programmer meant to return "start0 + 1".

mysql> select locate("","€€€€€€€€€€",11);

    +------------------------------------------------+
    | locate("","€€€€€€€€€€",11)                     |
    +------------------------------------------------+
    |                                             31 |
    +------------------------------------------------+
[20 May 2014 6:58] MySQL Verification Team
Hello Arthur,

Thank you for the bug report and test case.
Verified as described.

Thanks,
Umesh