Bug #118765 Substring_index with negative index returns inconsistent results across different character sets
Submitted: 1 Aug 2:56 Modified: 1 Aug 5:24
Reporter: chang wanli (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: DML Severity:S3 (Non-critical)
Version:8.0, 9.0, 8.4.6, 8.0.43 OS:Any
Assigned to: CPU Architecture:Any
Tags: SUBSTRING_INDEX

[1 Aug 2:56] chang wanli
Description:
When the index of `substring_index` is negative, the returned results are inconsistent across different character sets.​ See:

mysql> set names utf8;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> select substring_index('aaaaaaaaa1','aa', -1);
+----------------------------------------+
| substring_index('aaaaaaaaa1','aa', -1) |
+----------------------------------------+
| a1                                     |
+----------------------------------------+

mysql> set names latin1;
Query OK, 0 rows affected (0.00 sec)

mysql> select substring_index('aaaaaaaaa1','aa', -1);
+----------------------------------------+
| substring_index('aaaaaaaaa1','aa', -1) |
+----------------------------------------+
| 1                                      |
+----------------------------------------+

The explanation of the `substring_index` function on the MySQL official documentation is as follows:​
Returns the substring from string `str` before `count` occurrences of the delimiter `delim`. If `count` is positive, everything to the left of the final delimiter (counting from the left) is returned. If `count` is negative, everything to the right of the final delimiter (counting from the right) is returned. SUBSTRING_INDEX() performs a case-sensitive match when searching for `delim`. (https://dev.mysql.com/doc/refman/8.4/en/string-functions.html#function_substring-index)

This document does not explicitly clarify whether the `SUBSTRING_INDEX` function splits strings from left to right or right to left based on the delimiter. However, after reviewing its implementation code, we discovered that:​
- ​​For multi-byte encodings​​, the string is ​always split from left to right​ (regardless of whether `index` is positive or negative).
- ​​For single-byte encodings​​, if `index` is ​​negative​​, the string is split ​​from right to left​​.

​Consequently, when the delimiter consists of multiple characters, results may be inconsistent between single-byte and multi-byte encoding scenarios.​

```
  if (use_mb(res->charset())) {
     ...
  else {
     if (count_val.is_negative()) {
        /*
          Negative index, start counting at the end
        */
        longlong count_ll = count;
        for (offset = res->length(); offset;) {
           /*
             this call will result in finding the position pointing to one
             address space less than where the found substring is located
             in res
           */
           if ((offset = res->strrstr(*delimiter, offset)) < 0)
              return res;  // Didn't find, return org string
              /*
                At this point, we've searched for the substring
                the number of times as supplied by the index value
              */
           if (++count_ll == 0) {
              offset += delimiter_length;
              tmp_value.set(*res, offset, res->length() - offset);
              break;
           }
        }
        if (count_ll != 0) return res;  // Didn't find, return org string
     } else {
        ...
     }
  }
```

How to repeat:
set names utf8;
select substring_index('aaaaaaaaa1','aa', -1);
set names latin1;
select substring_index('aaaaaaaaa1','aa', -1);

Suggested fix:
We are not certain which computational method is correct, but at the very least, we must ensure consistency in the output results across both scenarios.​
[1 Aug 5:24] MySQL Verification Team
Hello chang wanli,

Thank you for the report and test case.

regards,
Umesh