Bug #22638 SOUNDEX broken for international characters
Submitted: 24 Sep 2006 15:39 Modified: 4 Apr 2007 0:43
Reporter: Daniel Eloff
Status: Closed
Category:Server Severity:S3 (Non-critical)
Version:5.0 BK, 4.1 BK, 5.1 BK OS:Linux (Linux)
Assigned to: Alexander Barkov Target Version:

[24 Sep 2006 15:39] Daniel Eloff
Description:
According to the documentation:

"All non-alphabetic characters in str are ignored. All international alphabetic characters
outside the A-Z range are treated as vowels."

And this seems to be true. Except for the first letter, which MySQL keeps. This is as it
should be, except for the way MySQL keeps the first letter. It does not seem to keep the
first letter properly for unicode, and I'm having trouble determining what it is really
doing.

How to repeat:
Using SELECT SOUNDEX('阅览随时更新的新闻') as the input, with the above string
converted to utf8 causes mysql to return garbage (� on phpmyadmin) as the first
character. I do not know if MySQL is really returning �, I suspect it is phpmyadmin
substituting it for a character it can't deal with. From python it does not return a utf8
encoded result, and the python library for mysql chokes. I do not know how to test utf8
input from the mysql prompt. Would someone please verify if this is a bug with mysql?

Suggested fix:
Just be consistant in returning the first character always, no matter what it is.
[6 Oct 2006 10:27] Sveta Smirnova
Thank you for the report.

Verified as described:

$bin/mysql -e "SELECT HEX('阅览随时更新的新闻'),
HEX(SOUNDEX('阅览随时更新的新闻'));"

HEX('阅览随时更新的新闻')	HEX(SOUNDEX('阅览随时更新的新闻'))
E99885E8A788E99A8FE697B6E69BB4E696B0E79A84E696B0E997BB	E9303030
[6 Oct 2006 11:21] Sveta Smirnova
Verified on Linux using last BK sources. All versions are affected.

OS and version flags are corrected.
[28 Mar 2007 16:01] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/23158
[30 Mar 2007 9:17] Alexander Barkov
Pushed into 5.0.40-rpl
Pushed into 5.1-18-rpl
[1 Apr 2007 1:53] Bugs System
Pushed into 5.0.40
[1 Apr 2007 1:55] Bugs System
Pushed into 5.1.18-beta
[4 Apr 2007 0:43] Paul DuBois
Noted in 5.0.40, 5.1.18 changelogs.

SOUNDEX() returned an invalid string for international characters in
multi-byte character sets.
[4 Sep 2007 18:54] Alexander Barkov
See also:

Bug#27782 MYSQL SOUNDEX collation, utf8_hungarian_ci shows false positive