Bug #2095 full-text search for words containing hyphens won't work
Submitted: 11 Dec 2003 5:15 Modified: 11 Dec 2003 9:43
Reporter: [ name withheld ] Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:4.0.16 OS:Microsoft Windows (Win2K, Linux)
Assigned to: Sergei Golubchik

[11 Dec 2003 5:15] [ name withheld ]
Description:
Dyfed Lloyd Evans posted the problem already on the MySQL Full-text search docs, but I didn't find any bug report:
Hyphen '-' characters break literals at the moment.
A search for something like "GATA-D22S690" finds
all entries containing GATA and not the full 
hyphenated text. The '-' character is treated as a
word stop even within literals. The same is true if
any of the special text search modifiers are used (eg
+, -, ~) so that hyphenated literals are not correctly
found with full text searches.

This is bothering me especially when searching for short terms like "POL-BN".

How to repeat:
mysql> CREATE TABLE articles (
    ->   id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
    ->   title VARCHAR(200),
    ->   body TEXT,
    ->   FULLTEXT (title,body)
    -> );

mysql> INSERT INTO articles VALUES
    -> (NULL,'MySQL Tutorial', 'DBMS stands for DataBase ...'),
    -> (NULL,'MySQL vs. YourSQL', 'In the following database comparison ...'),
    -> (NULL,'Searching for Hyphenized Words', 'POL-BN: test entry');

mysql> SELECT * FROM articles
    ->          WHERE MATCH (title,body) AGAINST ('POL-BN');
Empty set 0.00 sec

Suggested fix:
Only whitespace characters should break literals.
[11 Dec 2003 7:11] Sergei Golubchik
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.mysql.com/documentation/ and the instructions on
how to report a bug at http://bugs.mysql.com/how-to-report.php

Additional info:

According to the manual: http://www.mysql.com/doc/en/Fulltext_Search.html

"MySQL uses a very simple parser to split text into words. A ``word'' is any sequence of characters consisting of letters, digits, `'', and `_'."

That is this behavious is not a bug.
[11 Dec 2003 8:44] [ name withheld ]
Ok, you're right.

Do you know a workaround for a full-text search for terms like "POL-BN" ?
[11 Dec 2003 9:43] Sergei Golubchik
Yes. If you can rebuild mysqld, you can the desired behaviour fairy easily. In the myisam/ftdefs.h file there is

#define HYPHEN_IS_DELIM

you need to comment it out.
[2 Aug 2006 14:10] Rob Bach
Rebuilding for me is not a suitable workaround for this problem. Is there any other workaround? Searching for model numbers such as (ms-100) with mysql fulltext search produces undesired results. Is there a simplier solution such as editing a config file or something? Fulltext search is very powerful and easy to implement but this hyphen problem limits it's use for me.

See it in action at: http://www.marblefountains.com , check the model number issue and see if it resolved.
[29 Aug 2006 21:08] Ryan Kaldari
How can this not be a bug? It makes fulltext searching useless. Very few sets of data are guarenteed to not have any words with hyphens in them. Let's see, you can't use it to search a table of names (since many last names are hyphenated); you can't use it seach a table of Movies (X-Men); you can't use it search a table of Newspapers (Clarion-Ledger); Do I really need to go on? Obviously there needs to be an easy workaround to override the default word parsing behavior here.
[15 Sep 2006 19:17] Sergei Golubchik
There is another workaround. It was recently added to the manual:
"
Modify a character set file: This requires no recompilation. The true_word_char()  macro uses a “character type” table to distinguish letters and numbers from other characters. . You can edit the <ctype><map> contents in one of the character set XML files to specify that '-' is a “letter.” Then use the given character set for your FULLTEXT indexes.
"
[6 Dec 2006 15:54] Dale Greer
Please explain this fix. The <ctype><map> of which you speak is very cryptic. Lots of people are talking about this on the web, and I've not seen anyone who knows how to make the edit you're suggesting. Someone else said you shouldn't do that at all, but should make your own parser plugin, as if just anyone would know how to do that.
[10 Aug 2008 9:58] hans paijmans
full text search with hyphens definitly is broken. I changed the charactermap as avised, to allow for hyphens, and then in *mostly* works.

BUT...

in a database with dutch city names, in which all three examples exist:

handler lijst read asciiname=("etten-leur"); // works
handler lijst read asciiname=("'s gravenmoer"); // works
handler lijst read asciiname=("'s-hertogenbosch"); // does *not* work.
handler lijst read asciiname=("'s-gravenhage"); // does *not* work.

but again:
handler lijst read asciiname=("station s'hertogenbosch"); // works !!!

I am at my wits end (which, admittedly, is not very far)

Paai
[4 Nov 2010 14:52] Sheeri Cabral
I think this is worth revisiting, and I put my vote in for an easy way to change whether or not a hyphen is a delimiter/word boundary or a word character.
[4 Nov 2010 16:36] Julien Lamarche
My employer could also use a change in this behavior (or an easier way to change this behavior).  We have one customer upset because they are searching terms with hypens in them ("F-35")  and getting too many results back.