Bug #27782 MYSQL SOUNDEX collation, utf8_hungarian_ci shows false positive
Submitted: 12 Apr 2007 14:43 Modified: 4 Sep 2007 18:54
Reporter: Laszlo KAROLYI
Status: Duplicate
Category:Server: Charsets Severity:S3 (Non-critical)
Version:5.0.26, 5.0.33, 5.0.37 (latest) too OS:Linux (gentoo and debian)
Assigned to: Alexander Barkov Target Version:
Tags: utf8_hungarian_ci, soundex

[12 Apr 2007 14:43] Laszlo KAROLYI
Description:
Hello.

I have tested it on two machines: the soundex algorithm and queries with 'SOUNDS LIKE'
are reporting false records.

One workaround is to SUBSTR() from the second character, because the first character is
something weird in my cases, and as i could reproduce, this caracter causes the false
positives.

I code a forum, which uses the soundex algorithm to recognize that a user is inserting
one comment in many topic. (the same forum mentioned in bug #26917)

How to repeat:
Create a table, and insert a row in it (following contains hungarian words):

CREATE TABLE `proba` (
  `duma` text collate utf8_hungarian_ci NOT NULL COMMENT 'ize',
  FULLTEXT KEY `duma` (`duma`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_hungarian_ci;

INSERT INTO `proba` (`duma`) VALUES 
('Én úgy vagyok vele, motorszenzor legyen tökéletes. Főleg, hogy elektromos és nem
mechanikus, vákumos rendszerről van szó.\r\n\r\narról nem beszélve, hogy a Crome meg
a többi ki tudja mit művelt, hogyan hackelte az ECUt. Ha tökéletes lenne, nem
lennének updatek, tesztek és szívások...\r\npersze ez csak az én
véleményem...\r\nde először tényleg a vezetékezést, csatlakozót nézd meg.');

And then, select from it:

select * from proba where duma sounds like 'Ápr. 6.án elköltöztünk. De a postán át
lett irányíttatva az érkező posta ill. az új tulaj is szólna.
Nem hiszem, hogy a kerületben kavarodott volna el, postásunk nagyon megbízható.
Ajánlva adtátok fel?'

As you can see the two comment are not even nearly the same, but the query reports that
they are the same.

Suggested fix:
If you use the query like this:

select * from proba where SUBSTR(SOUNDEX(duma), 2) = SUBSTR('Ápr. 6.án elköltöztünk.
De a postán át lett irányíttatva az érkező posta ill. az új tulaj is szólna.
Nem hiszem, hogy a kerületben kavarodott volna el, postásunk nagyon megbízható.
Ajánlva adtátok fel?', 2)

The report will be the expected. The first character which the soundex algoritym reports,
is the sinner in this case.
[12 Apr 2007 14:45] Laszlo KAROLYI
sorry, the workaround query really looks like this:

select * from proba where SUBSTR(SOUNDEX(duma), 2) = SUBSTR(SOUNDEX('Ápr. 6.án
elköltöztünk. De a postán át lett irányíttatva az érkező posta ill. az
új tulaj is szólna.
Nem hiszem, hogy a kerületben kavarodott volna el, postásunk nagyon
megbízható.
Ajánlva adtátok fel?'), 2)

i forgot the secons SOUNDEX.
[13 Apr 2007 13:34] Sveta Smirnova
Thank you for the report.

I can not repeat it with current sources.

Please try with current 5.0.37 version and if you can repeat provide output of statements
SHOW VARIABLES LIKE '%coll%'; and SHOW VARIABLES LIKE '%char%';
[17 Apr 2007 12:37] Laszlo KAROLYI
Hello Sveta,

I have compiled 5.0.37, and it seems like the problem is still not solved.

Here are the settings you asked for:

mysql> SHOW VARIABLES LIKE '%coll%';
+----------------------+---------------------+
| Variable_name        | Value               |
+----------------------+---------------------+
| collation_connection | latin2_hungarian_ci |
| collation_database   | latin2_hungarian_ci |
| collation_server     | latin2_hungarian_ci |
+----------------------+---------------------+
3 rows in set (0.00 sec)

mysql> SHOW VARIABLES LIKE '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin2                     |
| character_set_connection | latin2                     |
| character_set_database   | latin2                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin2                     |
| character_set_server     | latin2                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

The settings are from the mysql command line, but in the code i use 'SET NAMES utf8'.
After that command, the outputs are looking like this:

mysql> SHOW VARIABLES LIKE '%coll%';
+----------------------+---------------------+
| Variable_name        | Value               |
+----------------------+---------------------+
| collation_connection | utf8_general_ci     |
| collation_database   | latin2_hungarian_ci |
| collation_server     | latin2_hungarian_ci |
+----------------------+---------------------+
3 rows in set (0.00 sec)

mysql> SHOW VARIABLES LIKE '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | latin2                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin2                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

Regards,
Laszlo
[17 Apr 2007 12:39] Laszlo KAROLYI
Sorry, i forgot to change the database too. Here are the outputs after changing the
database:

mysql> use crxforum;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> set names utf8;
Query OK, 0 rows affected (0.43 sec)

mysql> SHOW VARIABLES LIKE '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin2                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

mysql> SHOW VARIABLES LIKE '%coll%';
+----------------------+---------------------+
| Variable_name        | Value               |
+----------------------+---------------------+
| collation_connection | utf8_general_ci     |
| collation_database   | utf8_hungarian_ci   |
| collation_server     | latin2_hungarian_ci |
+----------------------+---------------------+
3 rows in set (0.36 sec)

mysql>

--
Laszlo
[17 Apr 2007 22:12] Sveta Smirnova
test case options

Attachment: bug27782-master.opt (application/octet-stream, text), 69 bytes.

[17 Apr 2007 22:12] Sveta Smirnova
test case

Attachment: bug27782.test (application/octet-stream, text), 1.45 KiB.

[17 Apr 2007 22:13] Sveta Smirnova
Thank you for the feedback.

Verified as described on Linux using attached test case. Only community tree is affected.
[4 Sep 2007 18:54] Alexander Barkov
I believe this problem was fixed in 5.0.40 under terms of
the bug#22638 "SOUNDEX broken for international characters".
http://bugs.mysql.com/bug.php?id=22638

I can repeat this problem with 5.0.22,
but it works fine 5.0.46.

I'm closing this bug as a duplicate for #22638