Bug #51370 Weighted Fulltext Query by Document Area Inefficient
Submitted: 22 Feb 2010 0:03 Modified: 22 Feb 2010 6:18
Reporter: Jaimie Sirovich Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: FULLTEXT search Severity:S4 (Feature request)
Version:5.1 OS:Any
Assigned to: CPU Architecture:Any
Tags: benchmarks, fulltext, speed, weight

[22 Feb 2010 0:03] Jaimie Sirovich
Description:
There is currently no (computationally-efficient) way to implement the following:

"Find me all documents on 'blah1 blah2' ...
**with preference/more importance on document title"**

Most propose solutions, but none of the solutions have nice query plans or just plain don't work.  For example:

A) Repeat the words twice, etc.
This doesn't work well, and especially on short documents this actually hurts.  Plus it's a hack by most measures.

B) WHERE MATCH(col1) ON (query) OR|+ MATCH(col2) ON (query)
This blows up right away with even small numbers of rows.  Actually, the second you use + or OR with FTS it basically won't hit the index.  I guess that makes sense.  Similar talk re: Oracle here --

http://www.dba-oracle.com/t_text_index_multi_column_search.htm

"The problem with this above approach was that we have found the performance to be unexpectedly slow when used with even 10,000 documents. In our environment a typical search with the above query with CONTAINS(1) OR CONTAINS(2) would take around 2 - 4 seconds to complete."

It seems like that approach will never be fast for computational reasons.  Oracle introduced 'field sections' to make it happen.

C) Using composite index to find, then the sum of a dual index to sort:
WHERE MATCH(col1_and_col_2_col) ON (query) ... ORDER BY MATCH(col1) ON (query) + MATCH(col2) ON (query)

This one works OK because the restriction is fast, but the sort still isn't cheap.

D) Resorting the results near the top.
This works well, but it's still not ideal, nor does SQL want to do a 'partial sort' without jumping through hoops.

It appears the weight idea can't be implemented without something in the index itself.  I don't see any way to do this currently, even with a custom parser.  There's no place to put the information.

How to repeat:
Attempt any of the above solutions and note that the query plan is poor or that the effects are undesirable (as with repeating the words in the indexed column).

Suggested fix:
I don't think any of these solutions actually 'works.'  I'd propose the following.  When you create a fulltext column INDEX, it can receive a default contribution/weight.  

CREATE FULLTEXT INDEX index_name ON tbl_name (col_1 WEIGHT 2, col_2 WEIGHT 1);

-or-

WHERE MATCH(col_1 WEIGHT 2, col_2 WEIGHT 1) ON (query)

The former could be really intuitive and useful for easy cases.  The 2nd method might allow for more.  It would require the words to have some identifying ID for which column they are from in both cases.
[22 Feb 2010 0:26] Jaimie Sirovich
Thought I should note that the best solutions are the latter 2 (C&D).  AFAIK there is no other way that works well over 100k+ rows.  D can be implemented with a subquery + outer ORDER BY IFF a limit offset is set within range 1..N.

--
Jaimie Sirovich
President
SEO Egghead, Inc.
RELEASED: Professional Search Engine Optimization with PHP & ASP.NET (Wrox Press)
http://www.seoegghead.com/our-seo-book/search-engine-optimization-with-php.seo
http://www.seoegghead.com/our-seo-book/search-engine-optimization-with-asp-net.seo
[22 Feb 2010 5:59] Valeriy Kravchuk
Thank you for the feature request.
[22 Feb 2010 6:18] Jaimie Sirovich
Am I correct in asserting there is no (efficient) way to do this currently?  If not, I will document the better solutions to help others on here a little better.  Right now, it appears the best solution is to use one aggregate column to index, then resort the top N results (factoring in title as a multiplicative factor), where N is a constant like 500.

So long as you don't care about discrepancies in paging by non multiples of N, you're fine, and the results are good.

Smaller databases do fine with the ORDER BY FT_1 * FT_2 solution.

I should add I noticed one other oddity.  A query of the form:

SELECT * FROM table WHERE

FULLTEXT_COND

ORDER BY FULLTEXT_COND

seems to order the data again.  This is not a bug, but it's a missing optimization.  I'll file that as another bug after I confirm it. 

Thanks again,
Jaimie.