Description:
When the index of `substring_index` is negative, the returned results are inconsistent across different character sets. See:
mysql> set names utf8;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> select substring_index('aaaaaaaaa1','aa', -1);
+----------------------------------------+
| substring_index('aaaaaaaaa1','aa', -1) |
+----------------------------------------+
| a1 |
+----------------------------------------+
mysql> set names latin1;
Query OK, 0 rows affected (0.00 sec)
mysql> select substring_index('aaaaaaaaa1','aa', -1);
+----------------------------------------+
| substring_index('aaaaaaaaa1','aa', -1) |
+----------------------------------------+
| 1 |
+----------------------------------------+
The explanation of the `substring_index` function on the MySQL official documentation is as follows:
Returns the substring from string `str` before `count` occurrences of the delimiter `delim`. If `count` is positive, everything to the left of the final delimiter (counting from the left) is returned. If `count` is negative, everything to the right of the final delimiter (counting from the right) is returned. SUBSTRING_INDEX() performs a case-sensitive match when searching for `delim`. (https://dev.mysql.com/doc/refman/8.4/en/string-functions.html#function_substring-index)
This document does not explicitly clarify whether the `SUBSTRING_INDEX` function splits strings from left to right or right to left based on the delimiter. However, after reviewing its implementation code, we discovered that:
- For multi-byte encodings, the string is always split from left to right (regardless of whether `index` is positive or negative).
- For single-byte encodings, if `index` is negative, the string is split from right to left.
Consequently, when the delimiter consists of multiple characters, results may be inconsistent between single-byte and multi-byte encoding scenarios.
```
if (use_mb(res->charset())) {
...
else {
if (count_val.is_negative()) {
/*
Negative index, start counting at the end
*/
longlong count_ll = count;
for (offset = res->length(); offset;) {
/*
this call will result in finding the position pointing to one
address space less than where the found substring is located
in res
*/
if ((offset = res->strrstr(*delimiter, offset)) < 0)
return res; // Didn't find, return org string
/*
At this point, we've searched for the substring
the number of times as supplied by the index value
*/
if (++count_ll == 0) {
offset += delimiter_length;
tmp_value.set(*res, offset, res->length() - offset);
break;
}
}
if (count_ll != 0) return res; // Didn't find, return org string
} else {
...
}
}
```
How to repeat:
set names utf8;
select substring_index('aaaaaaaaa1','aa', -1);
set names latin1;
select substring_index('aaaaaaaaa1','aa', -1);
Suggested fix:
We are not certain which computational method is correct, but at the very least, we must ensure consistency in the output results across both scenarios.