Bug #38676 Full-Text search does not handle apostrophes (before/after a word) as described
Submitted: 8 Aug 2008 17:38 Modified: 8 Aug 2008 19:16
Reporter: Thomas Chin Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: FULLTEXT search Severity:S3 (Non-critical)
Version:mysql-5.0.44-r1 OS:Linux (Gentoo)
Assigned to: CPU Architecture:Any
Tags: apostrophe, fulltext

[8 Aug 2008 17:38] Thomas Chin
Description:
To whom it may concern:

The MySQL Full-Text search does not seem to be handling apostrophes at the beginning and end of a word as described in bug id 14194.  A developer response in the mentioned bug report indicated that apostrophes at the beginning and end of a word are stripped:

--- begin quote ---
[20 Oct 2005 23:50] Miguel Solorzano

Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.mysql.com/documentation/ and the instructions on
how to report a bug at http://bugs.mysql.com/how-to-report.php

Additional info:

http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

The MySQL FULLTEXT implementation regards any sequence of true word characters (letters,
digits, and underscores) as a word. That sequence may also contain apostrophes ('), but
not more than one in a row. This means that aaa'bbb is regarded as one word, but aaa''bbb
is regarded as two words. Apostrophes at the beginning or the end of a word are stripped
by the FULLTEXT parser; 'aaa'bbb'  would be parsed as aaa'bbb.

So l'ouragan is regarded as one word and not just ouragan.
--- end quote ---

The behavior I am seeing is not consistent with the above description.  Furthermore, it mutates in an inconsistent manner when new data is inserted into the table/index.

How to repeat:
mysql> CREATE TABLE test (txt TEXT);
Query OK, 0 rows affected (0.01 sec)

mysql> CREATE FULLTEXT INDEX txt ON test(txt);
Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> INSERT INTO test VALUES('\'foo');
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO test VALUES('bar\'');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM test;
+------+
| txt  |
+------+
| 'foo |
| bar' |
+------+
2 rows in set (0.00 sec)

mysql> -- expect 1 row (*** unexpected result ***)
mysql> SELECT txt FROM test WHERE MATCH(txt) AGAINST('foo');
Empty set (0.00 sec)

mysql> -- expect 1 row (*** unexpected result ***)
mysql> SELECT txt FROM test WHERE MATCH(txt) AGAINST('bar');
Empty set (0.00 sec)

mysql> -- expect empty
mysql> SELECT txt FROM test WHERE MATCH(txt) AGAINST('\'foo');
Empty set (0.00 sec)

mysql> -- expect empty
mysql> SELECT txt FROM test WHERE MATCH(txt) AGAINST('bar\'');
Empty set (0.00 sec)

mysql> INSERT INTO test VALUES('\'aaa\'bbb\'');
Query OK, 1 row affected (0.00 sec)

mysql> -- expect 1 row
mysql> SELECT txt FROM test WHERE MATCH(txt) AGAINST('aaa\'bbb');
+-----------+
| txt       |
+-----------+
| 'aaa'bbb' |
+-----------+
1 row in set (0.00 sec)

mysql> -- expect 1 row (*** behavior changes from previous attempt ***)
mysql> SELECT txt FROM test WHERE MATCH(txt) AGAINST('foo');
+------+
| txt  |
+------+
| 'foo |
+------+
1 row in set (0.00 sec)

mysql> -- expect empty (*** unexpected result ***)
mysql> SELECT txt FROM test WHERE MATCH(txt) AGAINST('\'foo');
+------+
| txt  |
+------+
| 'foo |
+------+
1 row in set (0.00 sec)

Suggested fix:
From the description, "apostrophes at the beginning OR the end of the word are stripped", would seem to indicate that they should also be stripped from 'foo and bar'.  The behavior should also be consistent and not require the addition of a row that contains an apostrophe at the beginning, end, and middle of the word.
[8 Aug 2008 18:36] Sergei Golubchik
note that foo and bar weren't found, probably, because of ft_min_word_length, decrease ft_min_word_length or try with longer words
[8 Aug 2008 18:46] Thomas Chin
Sorry about that.  Our machines are configured to have ft_min_word_length=1, so I did previous tests that was able to location foo and bar (without apostrophes) as part of the test process I did not share.  From my example, it does also illustrate how the behavior mutates after a row of a certain type is inserted.

Regardless, same test with foo -> apple and bar -> orange:

---

mysql> CREATE TABLE test (txt TEXT);
Query OK, 0 rows affected (0.06 sec)

mysql> CREATE FULLTEXT INDEX txt ON test(txt);
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> INSERT INTO test VALUES('\'apple');
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO test VALUES('orange\'');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM test;
+---------+
| txt     |
+---------+
| 'apple  |
| orange' |
+---------+
2 rows in set (0.00 sec)

mysql> -- expect 1 row (*** unexpected result ***)
mysql> SELECT txt FROM test WHERE MATCH(txt) AGAINST('apple');
Empty set (0.00 sec)

mysql> -- expect 1 row (*** unexpected result ***)
mysql> SELECT txt FROM test WHERE MATCH(txt) AGAINST('orange');
Empty set (0.00 sec)

mysql> -- expect empty
mysql> SELECT txt FROM test WHERE MATCH(txt) AGAINST('\'apple');
Empty set (0.00 sec)

mysql> -- expect empty
mysql> SELECT txt FROM test WHERE MATCH(txt) AGAINST('orange\'');
Empty set (0.00 sec)

mysql> INSERT INTO test VALUES('\'aaa\'bbb\'');
Query OK, 1 row affected (0.00 sec)

mysql> -- expect 1 row
mysql> SELECT txt FROM test WHERE MATCH(txt) AGAINST('aaa\'bbb');
+-----------+
| txt       |
+-----------+
| 'aaa'bbb' |
+-----------+
1 row in set (0.00 sec)

mysql> -- expect 1 row (*** behavior changes from previous attempt ***)
mysql> SELECT txt FROM test WHERE MATCH(txt) AGAINST('apple');
+--------+
| txt    |
+--------+
| 'apple |
+--------+
1 row in set (0.00 sec)

mysql> -- expect empty (*** unexpected result ***)
mysql> SELECT txt FROM test WHERE MATCH(txt) AGAINST('\'apple');
+--------+
| txt    |
+--------+
| 'apple |
+--------+
1 row in set (0.00 sec)
[8 Aug 2008 18:56] Sergei Golubchik
Okay, I see now.

It's not a bug, but how relevance ranking works when you search in natural language mode. Check the appropriate section of the manual for details, but in short - a word gets 0 relevance (and thus, is not returned) if it appears in 50% (or more) of the rows. First you had only two rows, and neither was returned. Then you added a third one, and all the three started to appear in search results.

Possible workarounds are also explained in the manual.
[8 Aug 2008 19:16] Thomas Chin
Thanks guys and sorry about that.  We were conducting some testing of another issue and in trying to create a very simple test case, we encountered this behavior.