Bug #27782 MYSQL SOUNDEX collation, utf8_hungarian_ci shows false positive
Submitted: 12 Apr 2007 12:43 Modified: 4 Sep 2007 16:54
Reporter: Laszlo KAROLYI Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.0.26, 5.0.33, 5.0.37 (latest) too OS:Linux (gentoo and debian)
Assigned to: Assigned Account CPU Architecture:Any
Tags: soundex, utf8_hungarian_ci

[12 Apr 2007 12:43] Laszlo KAROLYI
Description:
Hello.

I have tested it on two machines: the soundex algorithm and queries with 'SOUNDS LIKE' are reporting false records.

One workaround is to SUBSTR() from the second character, because the first character is something weird in my cases, and as i could reproduce, this caracter causes the false positives.

I code a forum, which uses the soundex algorithm to recognize that a user is inserting one comment in many topic. (the same forum mentioned in bug #26917)

How to repeat:
Create a table, and insert a row in it (following contains hungarian words):

CREATE TABLE `proba` (
  `duma` text collate utf8_hungarian_ci NOT NULL COMMENT 'ize',
  FULLTEXT KEY `duma` (`duma`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_hungarian_ci;

INSERT INTO `proba` (`duma`) VALUES 
('Én úgy vagyok vele, motorszenzor legyen tökéletes. Főleg, hogy elektromos és nem mechanikus, vákumos rendszerről van szó.\r\n\r\narról nem beszélve, hogy a Crome meg a többi ki tudja mit művelt, hogyan hackelte az ECUt. Ha tökéletes lenne, nem lennének updatek, tesztek és szívások...\r\npersze ez csak az én véleményem...\r\nde először tényleg a vezetékezést, csatlakozót nézd meg.');

And then, select from it:

select * from proba where duma sounds like 'Ápr. 6.án elköltöztünk. De a postán át lett irányíttatva az érkező posta ill. az új tulaj is szólna.
Nem hiszem, hogy a kerületben kavarodott volna el, postásunk nagyon megbízható.
Ajánlva adtátok fel?'

As you can see the two comment are not even nearly the same, but the query reports that they are the same.

Suggested fix:
If you use the query like this:

select * from proba where SUBSTR(SOUNDEX(duma), 2) = SUBSTR('Ápr. 6.án elköltöztünk. De a postán át lett irányíttatva az érkező posta ill. az új tulaj is szólna.
Nem hiszem, hogy a kerületben kavarodott volna el, postásunk nagyon megbízható.
Ajánlva adtátok fel?', 2)

The report will be the expected. The first character which the soundex algoritym reports, is the sinner in this case.
[12 Apr 2007 12:45] Laszlo KAROLYI
sorry, the workaround query really looks like this:

select * from proba where SUBSTR(SOUNDEX(duma), 2) = SUBSTR(SOUNDEX('Ápr. 6.án
elköltöztünk. De a postán át lett irányíttatva az érkező posta ill. az
új tulaj is szólna.
Nem hiszem, hogy a kerületben kavarodott volna el, postásunk nagyon
megbízható.
Ajánlva adtátok fel?'), 2)

i forgot the secons SOUNDEX.
[13 Apr 2007 11:34] Sveta Smirnova
Thank you for the report.

I can not repeat it with current sources.

Please try with current 5.0.37 version and if you can repeat provide output of statements SHOW VARIABLES LIKE '%coll%'; and SHOW VARIABLES LIKE '%char%';
[17 Apr 2007 10:37] Laszlo KAROLYI
Hello Sveta,

I have compiled 5.0.37, and it seems like the problem is still not solved.

Here are the settings you asked for:

mysql> SHOW VARIABLES LIKE '%coll%';
+----------------------+---------------------+
| Variable_name        | Value               |
+----------------------+---------------------+
| collation_connection | latin2_hungarian_ci |
| collation_database   | latin2_hungarian_ci |
| collation_server     | latin2_hungarian_ci |
+----------------------+---------------------+
3 rows in set (0.00 sec)

mysql> SHOW VARIABLES LIKE '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin2                     |
| character_set_connection | latin2                     |
| character_set_database   | latin2                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin2                     |
| character_set_server     | latin2                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

The settings are from the mysql command line, but in the code i use 'SET NAMES utf8'. After that command, the outputs are looking like this:

mysql> SHOW VARIABLES LIKE '%coll%';
+----------------------+---------------------+
| Variable_name        | Value               |
+----------------------+---------------------+
| collation_connection | utf8_general_ci     |
| collation_database   | latin2_hungarian_ci |
| collation_server     | latin2_hungarian_ci |
+----------------------+---------------------+
3 rows in set (0.00 sec)

mysql> SHOW VARIABLES LIKE '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | latin2                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin2                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

Regards,
Laszlo
[17 Apr 2007 10:39] Laszlo KAROLYI
Sorry, i forgot to change the database too. Here are the outputs after changing the database:

mysql> use crxforum;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> set names utf8;
Query OK, 0 rows affected (0.43 sec)

mysql> SHOW VARIABLES LIKE '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin2                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

mysql> SHOW VARIABLES LIKE '%coll%';
+----------------------+---------------------+
| Variable_name        | Value               |
+----------------------+---------------------+
| collation_connection | utf8_general_ci     |
| collation_database   | utf8_hungarian_ci   |
| collation_server     | latin2_hungarian_ci |
+----------------------+---------------------+
3 rows in set (0.36 sec)

mysql>

--
Laszlo
[17 Apr 2007 20:12] Sveta Smirnova
test case options

Attachment: bug27782-master.opt (application/octet-stream, text), 69 bytes.

[17 Apr 2007 20:12] Sveta Smirnova
test case

Attachment: bug27782.test (application/octet-stream, text), 1.45 KiB.

[17 Apr 2007 20:13] Sveta Smirnova
Thank you for the feedback.

Verified as described on Linux using attached test case. Only community tree is affected.
[4 Sep 2007 16:54] Alexander Barkov
I believe this problem was fixed in 5.0.40 under terms of
the bug#22638 "SOUNDEX broken for international characters".
http://bugs.mysql.com/bug.php?id=22638

I can repeat this problem with 5.0.22,
but it works fine 5.0.46.

I'm closing this bug as a duplicate for #22638