Bug #6074 Problems with Soundex Function
Submitted: 13 Oct 2004 20:17 Modified: 16 Oct 2004 21:34
Reporter: Angela Kunz Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:4.0.18 OS:Windows (Windows 2003 and Windows XP)
Assigned to: CPU Architecture:Any

[13 Oct 2004 20:17] Angela Kunz
Description:
I am havinging an issue with the SOUNDEX() function. I have noticed two things. One it generates more than 4 characters on return for some names and for some names returns an incorrct soundex code. For example Chackes and Czakes produce C000 when they should Produce C220. Same thing for Ehackes returns E200 instead of E220. Also BIRNBAUM returns a 5 digit code when there are only suppose to be 4. Can anyone tell me what to do to refine this better? Or how I can get a correct Soundex function for myself and My ISP. I am using the code for a genealogy site and they are a little picky on this stuff. 

I posted this on your forums site under general and was told to report it.  Here is some additiona information that may help.

I use serveral SQL queries to get results from about 6 tables.  Each table contains at minimum the following fields:

Surname, First Name, Soundex. 

The databases(tables) are send to me from the local genealogy group with the soundex field prefilled.  They use three or four different soundex engines all producing the same results So I know the Soundex field is correct in the databases.  

An Example of a query would be as follows.

Select * from nassuau where `soundex`=soundex("surname");

The only othere piece of information that I have was a note sent to be by a collegue when I explained the issue I was running into even though it still does not explain why more than 4 Characters are being generated:

Anyway, I found this note, and I wonder if this is why it’s returning the problem

SOUNDEX(str)

Note: This function implements the original Soundex algorithm, not the more popular enhanced version. The difference is that original version discards vowels first and then duplicates, whereas the enhanced version discards duplicates first and then vowels.

HELP!!!!!!!

How to repeat:
Create a table with the following fields:

First Name,
Last Name,
Soundex (generated using access or Java)

Then Try to search it using the following SQL query:

Select * from Tablename where `soundex`=Soundex(`Last Name`);

Suggested fix:
Create a new SOUNDEX function and distribute it promply

Please keep me updated on this:

Angela Kunz
Angela B. Kunz LLC
Akunzllc@optonline.net
[13 Oct 2004 21:46] Sergei Golubchik
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.mysql.com/documentation/ and the instructions on
how to report a bug at http://bugs.mysql.com/how-to-report.php

Additional info:

1. about "it generates more than 4 characters" - as described in the manual (http://dev.mysql.com/doc/mysql/en/String_functions.html) "A standard soundex string is four characters long, but the `SOUNDEX()' function returns an arbitrarily long string. You can use `SUBSTRING()' on the result to get a standard soundex string." So, the workaround is to use SUBSTRING or LEFT:

  LEFT(SOUNDEX("somename"), 4)

2. about "Chackes and Czakes produce C000 when they should Produce C220".

The reason, indeed, is "Note: This function implements the original Soundex algorithm, not the more popular enhanced version. The difference is that original version discards vowels first and then duplicates, whereas the enhanced version discards duplicates first and then vowels."

Substituting numbers you'd get for "Chackes": C..22.2
Original soundex: discard vowels - C222, remove duplicates - C2
Enhanced version: remove duplicates - C.2.2, discard vowels - C22
[13 Oct 2004 21:51] Sergei Golubchik
About the fix - SOUNDEX() function cannot be changed, of course, for compatibility reasons.
But if somebody would provide an patch that implements a, say,  SOUNDEX_NEW (native function or udf) we would accept it, I assume.

http://dev.mysql.com/doc/mysql/en/Adding_functions.html

Alternatively, we could write such a function, but this todo item would get a very low priority :(
[16 Oct 2004 12:49] Angela Kunz
Is Metaphone(), a user created function found on your site,  the enhanced version of Soundex().  If so why has it not been added as the new soundex()?
[16 Oct 2004 21:34] Sergei Golubchik
It's not really an enhanced soundex, in fact t's not a soundex at all.

it's a different function that tries to achieve the same goal - transform a word to a token that depends only on how the word is pronounced, not how it's written down.

metaphon is based on English pronunciation rules, and works for English only.