Bug #11104 substring_index() will see two delimiters as one under certain conditions
Submitted: 5 Jun 2005 11:31 Modified: 8 Aug 2005 20:53
Reporter: Bob Stein (Candidate Quality Contributor) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:4.1.10, 4.1.12 OS:Windows (Windows, Linux)
Assigned to: Bugs System CPU Architecture:Any

[5 Jun 2005 11:31] Bob Stein
Description:
Conditions for effectively "merging" delimiters:
   - delimiter is multi-character and non-white (e.g. 'the')
   - counting from the right (count < 0)
   - string has two delimiters separated by only 0-1 characters

(This may not be the exact set of conditions but they would fit the observations below.)

How to repeat:
SELECT SUBSTRING_INDEX('the king of the the hill', 'the',-2)

should return " the hill" but returns " king of the the hill"

More examples of breaking and working...

Bad:

" king of the the hill" <-- SELECT SUBSTRING_INDEX('the king of the the hill', 'the',-2)
" king of thethe hill" <-- SELECT SUBSTRING_INDEX('the king of thethe hill',  'the',-2)
" king of the.the hill",  "SELECT SUBSTRING_INDEX('the king of the.the hill', 'the',-2)

Good:

"  the hill" <-- SELECT SUBSTRING_INDEX('the king of the  the hill','the',-2)
"..the hill" <-- SELECT SUBSTRING_INDEX('the king of the..the hill','the',-2)
"orthe hill" <-- SELECT SUBSTRING_INDEX('the king of theorthe hill','the',-2)

Suggested fix:
Gotta be real tricky...
[5 Jun 2005 13:26] Jorge del Conde
Thank you for your bug report.

I was able to verify this bug using 5.0.X from bk:

mysql> SELECT SUBSTRING_INDEX('the king of the  the hill','the',-2)
+-------------------------------------------------------+
| SUBSTRING_INDEX('the king of the  the hill','the',-2) |
+-------------------------------------------------------+
|   the hill                                            |
+-------------------------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT SUBSTRING_INDEX('the king of the the hill',
    -> 'the',-2)
    -> ;
+-------------------------------------------------------+
| SUBSTRING_INDEX('the king of the the hill',
'the',-2) |
+-------------------------------------------------------+
|  king of the the hill                                 |
+-------------------------------------------------------+
1 row in set (0.00 sec)

mysql>
[27 Jul 2005 17:14] Patrick Galbraith
Investigating the 'else' in Item_func_substr_index which handles negative indexes. I suspect String.strstr might be called incorrectly.
[27 Jul 2005 19:20] Patrick Galbraith
In investigating:

Query: 
mysql> select substring_index('aaaa1', 'aa', -2);
+------------------------------------+
| substring_index('aaaa1', 'aa', -2) |
+------------------------------------+
| aaaa1                              |
+------------------------------------+

Item_func_substr_index::val_str:

res->Ptr = "aaaa1"
res->str_length = 5
delimeter->Ptr = "aa"
delimeter->str_length= 2
count= -2
offset= 5 (this means 'aaaa1' pointing to '1')
delimiter_length= 2

call to offset= res->strstr("aa", 5)
offset= 2 (this means pointing to second 'a' in 'aaaa1')
delimeter_length= 2

if(!++count) yields count= -1

offset -= delimiter_length-1 = 2-1= 1 ??? (this means pointing to very first 'a' in 'aaaa1')

Why is this being decremented after strstr has already 'found' the location for us? I'm wondering if we need to be checking/counting how many times we've 'found' the search string, so we know that we _don't_ want to decrement this?
[27 Jul 2005 22:12] Patrick Galbraith
The more I analyse this I see that we decrement by too much - in the case of 
"the kind of the the hill", we start out with an offset of 24. We call strstr, it finds the first reverse substring of "the", returns an offset of 16, which would be pointing at "the king of the* the hill", we then decrement by delimiter length -1 to 14. It's now pointing at "the king of t*he the hill", we then call strstr again, but it cannot find the second 'the' (in reverse) because we've decremented the offset by too much - there's no way to find it with "the king of th", and the only 'the' it can find is at the beginning!
[27 Jul 2005 22:53] Patrick Galbraith
Potential fix results:

mysql> select substring_index('the king of the the hill', 'the', -2);
+--------------------------------------------------------+
| substring_index('the king of the the hill', 'the', -2) |
+--------------------------------------------------------+
|  the hill                                              |
+--------------------------------------------------------+
1 row in set (0.01 sec)

mysql> select substring_index('the king of the.the hill', 'the', -2);
+--------------------------------------------------------+
| substring_index('the king of the.the hill', 'the', -2) |
+--------------------------------------------------------+
| .the hill                                              |
+--------------------------------------------------------+
1 row in set (0.00 sec)

mysql> select substring_index('the king of theorthe hill', 'the', -2);
+---------------------------------------------------------+
| substring_index('the king of theorthe hill', 'the', -2) |
+---------------------------------------------------------+
| orthe hill                                              |
+---------------------------------------------------------+
1 row in set (0.01 sec)

mysql> select substring_index('the king of the  the hill', 'the', -2);
+---------------------------------------------------------+
| substring_index('the king of the  the hill', 'the', -2) |
+---------------------------------------------------------+
|   the hill                                              |
+---------------------------------------------------------+
1 row in set (0.01 sec)

mysql> select substring_index('the king of  the  the hill', 'the', -2);
+----------------------------------------------------------+
| substring_index('the king of  the  the hill', 'the', -2) |
+----------------------------------------------------------+
|   the hill                                               |
+----------------------------------------------------------+
1 row in set (0.01 sec)

mysql> select substring_index('the king of  thethethe hill', 'the', -2);
+-----------------------------------------------------------+
| substring_index('the king of  thethethe hill', 'the', -2) |
+-----------------------------------------------------------+
| the hill                                                  |
+-----------------------------------------------------------+
1 row in set (0.02 sec)

mysql> select substring_index('the king of  thethethe hill', 'the', -3);
+-----------------------------------------------------------+
| substring_index('the king of  thethethe hill', 'the', -3) |
+-----------------------------------------------------------+
| thethe hill                                               |
+-----------------------------------------------------------+
1 row in set (0.01 sec)

mysql> select substring_index('the king of  thethethe hill', 'the', -4);
+-----------------------------------------------------------+
| substring_index('the king of  thethethe hill', 'the', -4) |
+-----------------------------------------------------------+
|  king of  thethethe hill                                  |
+-----------------------------------------------------------+
1 row in set (0.00 sec)

mysql> select substring_index('the king of  thethethe hill', 'the', -1);
+-----------------------------------------------------------+
| substring_index('the king of  thethethe hill', 'the', -1) |
+-----------------------------------------------------------+
|  hill                                                     |
+-----------------------------------------------------------+
1 row in set (0.00 sec)
[28 Jul 2005 1:06] Patrick Galbraith
committed changeset 1.1891
[8 Aug 2005 20:46] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/28034
[8 Aug 2005 20:53] Patrick Galbraith
patch approved and pushed.