MySQL Bugs: #114418: Flaw in SOUNDEX implementation causing incorrect results

Bug #114418	Flaw in SOUNDEX implementation causing incorrect results
Submitted:	19 Mar 2024 23:20	Modified:	20 Mar 2024 7:19
Reporter:	Eric Winkelmann	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: DML	Severity:	S2 (Serious)
Version:	8.0.36	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	FUNCTION, soundex

Description:
The implementation of the SOUNDEX string function uses the code of the first character instead of just storing the first character. If the second non-zero coded letter of the word has the same code as the first letter, the logic then skips that code as a duplicate when it should not. This is resulting in improperly coded words by the built-in SOUNDEX function.

How to repeat:
SELECT SOUNDEX('Joseph')

This command currently returns "J100" whereas the proper soundex value for Joseph is "J210"

This happens because the code for J is 2 and the code for S is 2 (o is skipped). Since the J is preserved when creating the soundex code its coded value should not be saved and compared against the next coded letter during the duplicate check process. "Moseph" returns M210 as it should, but only because the code for M is 5 instead of 2. "Koseph" is incorrectly coded as K100 instead of K210 because the code for K is also 2.

Suggested fix:
In the current branch of the mysql-server codebase on GitHub, change line 2107 of the file sql/item_strfunc.cc from:

last_ch = get_scode(*from);        // Code of the first letter

To:

last_ch = 0;        // Ignore the first letter

Which will result in this issue never happening, as 0 is the code for ignored letters thus the first coded character will be interpreted properly no matter what the first letter of the word is, which is fully in line with the soundex algorithm.

Permalink:
https://github.com/mysql/mysql-server/blob/824e2b4064053f7daf17d7f3f84b7a3ed92e5fb4/sql/it...

Hello Eric Winkelmann,

Thank you for the report and feedback.

regards,
Umesh