Bug #2379 Usage of CDAWG's or Suffix Trees in fulltext search
Submitted: 13 Jan 2004 16:15 Modified: 23 Feb 2004 10:44
Reporter: Ricardo Garcia Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S4 (Feature request)
Version:all OS:Any (all)
Assigned to: Sergei Golubchik CPU Architecture:Any

[13 Jan 2004 16:15] Ricardo Garcia
Description:
This is some info for the MySQL developers:
I did a heavy research on data structures and information theory for my Bachelor Thesis (in spanish).

While reading the documentation for the mysql full text search ( http://www.mysql.com/doc/en/Fulltext_Search.html#IDX1653
), I realized that this is a very CPU-intensive application.

If the guys programming MySQL perhaps made some research with suffix trees or CDAWGS (Compact Directed Acyclic Word Graphs), maybe they would boost the search performance WAY UP. These nifty data structures are linear (in both size and build time) with respect of the string size. There have been numerous researchs on these.

My Thesis was about using sliding-window suffix trees for data compression (effective 4:1), but of course suffix trees and CDAWGS can be applied for text search.

The reason is that once a suffix tree is built, search time is linear with respect to the searched string. Suffix Trees and CDAWG's have been used in genetic research, where one has to search for specific genes in extra-long DNA strings.

Here's some info on these:

http://citeseer.nj.nec.com/cs?q=CDAWG&cs=1
http://citeseer.nj.nec.com/cs?cs=1&q=suffix+tree&submit=Documents&co=Expected+Citations&cm...

If you guys are interested, please drop me a comment to rick_g22 [ at ] yahoo [dot] com so I can give you the full references I researched for my project.

Keep up the good work!

How to repeat:
N/A
[13 Jan 2004 17:09] Ricardo Garcia
Here's a nice way to add blazingly-fast full text search index using suffix trees (I just figured it out):

1) When building the index, add a suffix tree (or CDAWG, it's more compact) for each searcheable column in the table. So if you have 10,000,000 records you would have 10,000,000 "small" suffix trees. Or 20,000,000 if you have a summary field.

The suffix trees would be of course, stored in variable-length records. You could have an index pointing to the suffix trees.

2) Each node in the tree would have a count number information.

To search a word in the table, for all the records implement the suffix-tree search algorithm.

To get the count of the given word, just read the counter on the corresponding suffix node. Ta-da!

According to the algorithm, the suffix tree search is linear regarding phrase  size, not data size. So, let's say that 'M' is the length of the word to be searched, and 'R' the number of records.

So searching a word in the database would be O(M*R) complexity: the search time would not depend on the data size, but just on the number of records.

Interesting, isn't it?
[14 Jan 2004 0:25] Sergei Golubchik
Thanks for your ideas and pointers!

What I know about CDAWG tells me that it's a read-only structure, or at least, it's very costly to update. Thats why I ruled it out from the very beginning.

A drawback of suffix trees is that is tends to be big - much larger than inverted index. I suppose that CDAWG had a comparable size, but at a cost of being non-updetable.

These structures are designed to solve different problem - substring search - not what fulltext search in MySQL does. They could be probably added as a new "substring" index - useful for - as you said - searching in DNA sequences.

As for your algorithm, it's O(M), while searching in a btree is still O(log M). Actually it's O(log <size_of_data>), but still it grows slower as O(M). And for many-rows datasets btree will be faster.

As for "CPU-intensive" - I don't have solid figures yet, but my experience is that the main problem is still disk i/o, not CPU.
Of course, I mean boolean search - in natural language mode the most time-consuming task often is calculating relevances. CDAWG or suffix tree cannot help here :(
[23 Feb 2004 10:44] Sergei Golubchik
discussion is moved out of bugdb