Description:
The implementation of the SOUNDEX string function uses the code of the first character instead of just storing the first character. If the second non-zero coded letter of the word has the same code as the first letter, the logic then skips that code as a duplicate when it should not. This is resulting in improperly coded words by the built-in SOUNDEX function.
How to repeat:
SELECT SOUNDEX('Joseph')
This command currently returns "J100" whereas the proper soundex value for Joseph is "J210"
This happens because the code for J is 2 and the code for S is 2 (o is skipped). Since the J is preserved when creating the soundex code its coded value should not be saved and compared against the next coded letter during the duplicate check process. "Moseph" returns M210 as it should, but only because the code for M is 5 instead of 2. "Koseph" is incorrectly coded as K100 instead of K210 because the code for K is also 2.
Suggested fix:
In the current branch of the mysql-server codebase on GitHub, change line 2107 of the file sql/item_strfunc.cc from:
last_ch = get_scode(*from); // Code of the first letter
To:
last_ch = 0; // Ignore the first letter
Which will result in this issue never happening, as 0 is the code for ignored letters thus the first coded character will be interpreted properly no matter what the first letter of the word is, which is fully in line with the soundex algorithm.
Permalink:
https://github.com/mysql/mysql-server/blob/824e2b4064053f7daf17d7f3f84b7a3ed92e5fb4/sql/it...