Bug #14194 Problem with fulltext boolean search and apostrophe
Submitted: 20 Oct 2005 21:09 Modified: 17 Jan 2006 23:49
Reporter: Jean-Jacques ANDRE Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:4.1.11 OS:Windows (Windows)
Assigned to: Sergey Vojtovich CPU Architecture:Any

[20 Oct 2005 21:09] Jean-Jacques ANDRE
Description:
SELECT titre,texte FROM editorial WHERE MATCH (titre,texte) AGAINST ('étude' IN BOOLEAN MODE)

returns no match whereas 

SELECT titre,texte FROM editorial WHERE MATCH (titre,texte) AGAINST ('l\'étude' IN BOOLEAN MODE)

returns a match. It is a big problem in french, for instance,  where l' m' n' s' are common prefixes to a lot of words like in the following sentence:

" l'ouragan m'asperge d'embruns de l'océan "

You cannot find any of these words in fulltext search unless you look for `l'ouragan` or `d'embruns`... You do not want do that. Really annoying...

How to repeat:
CREATE TABLE `editorial` (
  `title` varchar(255) default NULL,
  `searchtext` text NOT NULL,
  FULLTEXT KEY `titre` (`titre`,`searchtext`)
) TYPE=MyISAM;

-- 
-- Contenu de la table `editorial`
-- 
INSERT INTO `editorial` 
VALUES 
('Des effets de la poésie sur le génie des langues', 
'l''ouragan m''asperge d''embruns de l''océan');

Perform the following query:
SELECT titre FROM editorial WHERE MATCH (titre,searchtext) AGAINST ('ouragan' IN BOOLEAN MODE);

No match.........ggggggrrrr................

Suggested fix:
None
[20 Oct 2005 21:50] MySQL Verification Team
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.mysql.com/documentation/ and the instructions on
how to report a bug at http://bugs.mysql.com/how-to-report.php

Additional info:

http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

The MySQL FULLTEXT implementation regards any sequence of true word characters (letters, digits, and underscores) as a word. That sequence may also contain apostrophes ('), but not more than one in a row. This means that aaa'bbb is regarded as one word, but aaa''bbb is regarded as two words. Apostrophes at the beginning or the end of a word are stripped by the FULLTEXT parser; 'aaa'bbb'  would be parsed as aaa'bbb.

So l'ouragan is regarded as one word and not just ouragan.
[21 Oct 2005 6:43] Jean-Jacques ANDRE
I definitely think that it is a BUG from a linguistic standpoint.

Take the sentence : "I smashed Jerry's car" , you will not be able to find "Jerry" in the text unless you KNOW that it followed by the apostrophe. I do not see the point of looking at that as a feature; fulltext search becomes almost useless in languages making heavy use of the apostrophe (Hey I am french!). Every french word that starts with a vowel is a candidate for not being searchable. I found tons of forum messages on the web of puzzled italians, french,etc... Even in english, it is a problem, come on! 

Solution: If people want to search -Jerry's- in the text, they should enclose it in double-quotes as stated in the doc -> "Jerry's". Doesn't that seem logical?
[21 Oct 2005 11:55] Sergei Golubchik
This change cannot be done in 4.1, but we'll consider doing it in 5.0
[22 Oct 2005 10:11] Jean-Jacques ANDRE
Fixing that in 5.0 would be great and a good reason to upgrade to that version for a number of latin languages developers
[13 Jan 2006 17:40] Sergey Vojtovich
This bug cannot be fixed in 5.0, as 5.0 is now stable.
Will be fixed in 5.1.
[13 Jan 2006 19:34] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/1056
[16 Jan 2006 19:37] Sergey Vojtovich
Fixed in 5.1.6.
[16 Jan 2006 22:17] Mike Hillyer
Added to 5.1.6 changelog:

<listitem>
        <para>
          Words with apostrophes are now matched in a FULLTEXT search
          against  non-apostrophe words (i.e. a search for
          <literal>Jerry</literal> will match against the term
          <literal>Jerry's</literal>). (Bug #14194)
        </para>
      </listitem>
[17 Jan 2006 10:51] Jean-Jacques ANDRE
Thanks guys for the 5.1.6 fix! One more reason to upgrade.
I think not only me will appreciate that.
JJ
[17 Jan 2006 10:57] Sergey Vojtovich
Mike, I'm sorry, but I forgot one thing that deserves to be mentioned in changelog.
Upgrade to version, that include this changeset, will most likely result in broken
fulltext index, since parsing rules has been changed. REPAIR TABLE statement should
be issued for tables having fulltext indexes.

jj a: you're always welcome! :)
[17 Jan 2006 23:49] Mike Hillyer
REPAIR TABLE warning added to changelog entry.