Bug #22638 SOUNDEX broken for international characters
Submitted: 24 Sep 2006 13:39 Modified: 3 Apr 2007 22:43
Reporter: Daniel Eloff Email Updates:
Status: Closed Impact on me:
Category:MySQL Server Severity:S3 (Non-critical)
Version:5.0 BK, 4.1 BK, 5.1 BK OS:Linux (Linux)
Assigned to: Alexander Barkov CPU Architecture:Any

[24 Sep 2006 13:39] Daniel Eloff
According to the documentation:

"All non-alphabetic characters in str are ignored. All international alphabetic characters outside the A-Z range are treated as vowels."

And this seems to be true. Except for the first letter, which MySQL keeps. This is as it should be, except for the way MySQL keeps the first letter. It does not seem to keep the first letter properly for unicode, and I'm having trouble determining what it is really doing.

How to repeat:
Using SELECT SOUNDEX('阅览随时更新的新闻') as the input, with the above string converted to utf8 causes mysql to return garbage (� on phpmyadmin) as the first character. I do not know if MySQL is really returning �, I suspect it is phpmyadmin substituting it for a character it can't deal with. From python it does not return a utf8 encoded result, and the python library for mysql chokes. I do not know how to test utf8 input from the mysql prompt. Would someone please verify if this is a bug with mysql?

Suggested fix:
Just be consistant in returning the first character always, no matter what it is.
[6 Oct 2006 8:27] Sveta Smirnova
Thank you for the report.

Verified as described:

$bin/mysql -e "SELECT HEX('阅览随时更新的新闻'), HEX(SOUNDEX('阅览随时更新的新闻'));"

HEX('阅览随时更新的新闻')	HEX(SOUNDEX('阅览随时更新的新闻'))
E99885E8A788E99A8FE697B6E69BB4E696B0E79A84E696B0E997BB	E9303030
[6 Oct 2006 9:21] Sveta Smirnova
Verified on Linux using last BK sources. All versions are affected.

OS and version flags are corrected.
[28 Mar 2007 14:01] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

[30 Mar 2007 7:17] Alexander Barkov
Pushed into 5.0.40-rpl
Pushed into 5.1-18-rpl
[31 Mar 2007 23:53] Bugs System
Pushed into 5.0.40
[31 Mar 2007 23:55] Bugs System
Pushed into 5.1.18-beta
[3 Apr 2007 22:43] Paul DuBois
Noted in 5.0.40, 5.1.18 changelogs.

SOUNDEX() returned an invalid string for international characters in
multi-byte character sets.
[4 Sep 2007 16:54] Alexander Barkov
See also:

Bug#27782 MYSQL SOUNDEX collation, utf8_hungarian_ci shows false positive