MySQL Bugs: #118765: Substring_index with negative index returns inconsistent results across different character sets

Bug #118765	Substring_index with negative index returns inconsistent results across different character sets
Submitted:	1 Aug 2:56	Modified:	1 Aug 5:24
Reporter:	chang wanli (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: DML	Severity:	S3 (Non-critical)
Version:	8.0, 9.0, 8.4.6, 8.0.43	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	SUBSTRING_INDEX

Description:
When the index of `substring_index` is negative, the returned results are inconsistent across different character sets. See:

mysql> set names utf8;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> select substring_index('aaaaaaaaa1','aa', -1);
+----------------------------------------+
| substring_index('aaaaaaaaa1','aa', -1) |
+----------------------------------------+
| a1                                     |
+----------------------------------------+

mysql> set names latin1;
Query OK, 0 rows affected (0.00 sec)

mysql> select substring_index('aaaaaaaaa1','aa', -1);
+----------------------------------------+
| substring_index('aaaaaaaaa1','aa', -1) |
+----------------------------------------+
| 1                                      |
+----------------------------------------+

The explanation of the `substring_index` function on the MySQL official documentation is as follows:
Returns the substring from string `str` before `count` occurrences of the delimiter `delim`. If `count` is positive, everything to the left of the final delimiter (counting from the left) is returned. If `count` is negative, everything to the right of the final delimiter (counting from the right) is returned. SUBSTRING_INDEX() performs a case-sensitive match when searching for `delim`. (https://dev.mysql.com/doc/refman/8.4/en/string-functions.html#function_substring-index)

This document does not explicitly clarify whether the `SUBSTRING_INDEX` function splits strings from left to right or right to left based on the delimiter. However, after reviewing its implementation code, we discovered that:
- For multi-byte encodings, the string is always split from left to right (regardless of whether `index` is positive or negative).
- For single-byte encodings, if `index` is negative, the string is split from right to left.

Consequently, when the delimiter consists of multiple characters, results may be inconsistent between single-byte and multi-byte encoding scenarios.

```
  if (use_mb(res->charset())) {
     ...
  else {
     if (count_val.is_negative()) {
        /*
          Negative index, start counting at the end
        */
        longlong count_ll = count;
        for (offset = res->length(); offset;) {
           /*
             this call will result in finding the position pointing to one
             address space less than where the found substring is located
             in res
           */
           if ((offset = res->strrstr(*delimiter, offset)) < 0)
              return res;  // Didn't find, return org string
              /*
                At this point, we've searched for the substring
                the number of times as supplied by the index value
              */
           if (++count_ll == 0) {
              offset += delimiter_length;
              tmp_value.set(*res, offset, res->length() - offset);
              break;
           }
        }
        if (count_ll != 0) return res;  // Didn't find, return org string
     } else {
        ...
     }
  }
```

How to repeat:
set names utf8;
select substring_index('aaaaaaaaa1','aa', -1);
set names latin1;
select substring_index('aaaaaaaaa1','aa', -1);

Suggested fix:
We are not certain which computational method is correct, but at the very least, we must ensure consistency in the output results across both scenarios.

Hello chang wanli,

Thank you for the report and test case.

regards,
Umesh