Bug #7671 exact phrase on FullText search
Submitted: 5 Jan 2005 0:01 Modified: 31 Jan 2005 19:18
Reporter: Ulterior Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Documentation Severity:S3 (Non-critical)
Version:4.0.3 beta OS:Windows (WinNT)
Assigned to: Paul DuBois CPU Architecture:Any

[5 Jan 2005 0:01] Ulterior
Description:
Problem with Mysql FullText exact phrase search in boolean mode is that it extracts words from given 'exact phrase'.

From manual:

'"some words"' 
Find rows that contain the exact phrase ``some words'' (for example, rows that contain ``some words of wisdom'' but not ``some noise words''). Note that the `"' characters that surround the phrase are operator characters that delimit the phrase. They are not the quotes that surround the search string itself. 

I admit, though, that exact phrase should be at least ft-minwordlen. There's no reference to extract words from the 'exact phrase' as I see it from description, so I report it like a bug.

How to repeat:
It's quite easy. My current ft_minwordlen is set to 4

CREATE TABLE `t1` (
  `ft_col` text NOT NULL,
  FULLTEXT KEY (`ft_col`)
);

INSERT INTO t1 VALUES ('t12.ace');
INSERT INTO t1 VALUES ('t123.ace');
INSERT INTO t1 VALUES ('t123.gif');

> select count(*) from t1 where match(ft_col) against ( '"t12.ace"' in boolean mode )
No results. Thats becouse extracted t12 and ace are shorter than minwordlen

> select count(*) from t1 where match(ft_col) against ( '"t123.ace"' in boolean mode )
1 row. t123 is 4 chars length as my minwordlen

Howerer

> select count(*) from t1 where match(ft_col) against ( '"t12.ace1"' in boolean mode )
1 row. Now thats becouse ace1 is 4 chars length

It appers to me, that mysql fulltext handler extracts "t123 and ace" separately, then does boolean search on table and then string-compares column result with givven exact phrase. How bizzare :)

Suggested fix:
I would prefer knowing that exact phrase is exact and it should be used so - without further processing of it. If it's not possible to accomplish with current fulltext engine it should be stated in that way
[5 Jan 2005 4:45] MySQL Verification Team
Your server version is very older, please look the below output:

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1 to server version: 4.0.23-nt

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> select count(*) from t1 where match(ft_col) against ( '"t12.ace"' in bool
ean
    -> mode );
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.02 sec)

mysql> select count(*) from t1 where match(ft_col) against ( '"t123.ace"' in boo
lean
    -> mode );
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (0.04 sec)

mysql> select count(*) from t1 where match(ft_col) against ( '"t12.ace1"' in boo
lean
    -> mode );
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.04 sec)

mysql>
mysql> select * from t1 where match(ft_col) against ( '"t12.ace"' in boolean
    -> mode );
Empty set (0.00 sec)

mysql> select * from t1 where match(ft_col) against ( '"t123.ace"' in boolean
    -> mode );
+----------+
| ft_col   |
+----------+
| t123.ace |
+----------+
1 row in set (0.02 sec)

mysql> select * from t1 where match(ft_col) against ( '"t12.ace1"' in boolean
    -> mode );
Empty set (0.01 sec)

mysql>
[5 Jan 2005 10:13] Ulterior
> mysql> select count(*) from t1 where match(ft_col) against ( '"t12.ace"' in boolean -> mode );

+----------+
| count(*) |
+----------+
|        0 |
+----------+

It should be '1' - exact phrase "t12.ace" is more than 4 chars in length. I must repeat myself and say that - 

'In fulltext search using exact phrase no processing of 'exact phrase' should be taken and used as 'it is' when searching. In the current situation mysql parser 'exact phrase' with dot as delimeter, than uses fts on table with 2 words ( and it should not be so ) and then string compares result with exact phrase.

Please read again my submision comments on every mysql select I made.
[5 Jan 2005 11:11] Hartmut Holzgraefe
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.mysql.com/documentation/ and the instructions on
how to report a bug at http://bugs.mysql.com/how-to-report.php

Additional info:

This is expected behaviour documented on the manual page quoted below,
a fulltext index is always based on and queried against 'words', not general
phrases. As the '.' is not a 'word character' it is treated as a word boundary.

see http://dev.mysql.com/doc/mysql/en/Fulltext_Search.html :

MySQL uses a very simple parser to split text into words. A ``word'' is any sequence of
true word characters (letters, digits, and underscores), optionally separated by no more
than one sequential `'' character. For example, wasn't is parsed as a single word, but 
wasn''t is parsed as two words wasn and t. (And then t would be ignored as too short according to the rules following.) Also, single quotes at the ends of words are stripped; 
only embedded single quotes are retained.
[5 Jan 2005 11:23] Ulterior
I am sorry, but I'll have to open that bug again :)

Did you noticed that in my bug-post I've mentioned 'EXACT PHRASE', that means

I used 
DOUBLE-QUOTES ( \" )
on fulltext search to match EXACT PHRASE. You missunderstood me in that way. I know how mysql behave using AGAINST ( 'some word' ). I point out situation when AGAINST ( '\"some word\"' ) is used or in other words EXACT PHRASE

Please re-read my whole posting again and carefully. Thank you
[5 Jan 2005 20:25] Sergei Golubchik
You are right - this is exactly as phrase search works (split the phrase into words, search for them in the index, substring search in the found records).
So to have it working you need the phrase to contain at least one word that is indexed (longer than ft_min_word_len, and not stopword).

It looks like the manual is not clear on this, we'll fix it.
[5 Jan 2005 21:28] Ulterior
I think I'm getting to understand such fts behavior. You cannot perform exact phrase search not splitting it to words, becouse index is build in that way. Well, if it can't be fixed, will have to live with it.
[7 Jan 2005 21:08] Brian Aker
Hi!

One suggestion that I might give (though I don't know if it will completely work for you) is to 
write a UDF to validate the entire phrase. Have the full text index return values, but before 
sending them to the client have a UDF parse them for accuracy and use a HAVING to remove bad 
results.
[31 Jan 2005 19:18] Paul DuBois
Thank you for your bug report. This issue has been addressed in the
documentation. The updated documentation will appear on our website
shortly, and will be included in the next release of the relevant
product(s).