Bug #2379 | Usage of CDAWG's or Suffix Trees in fulltext search | ||
---|---|---|---|
Submitted: | 13 Jan 2004 16:15 | Modified: | 23 Feb 2004 10:44 |
Reporter: | Ricardo Garcia | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server | Severity: | S4 (Feature request) |
Version: | all | OS: | Any (all) |
Assigned to: | Sergei Golubchik | CPU Architecture: | Any |
[13 Jan 2004 16:15]
Ricardo Garcia
[13 Jan 2004 17:09]
Ricardo Garcia
Here's a nice way to add blazingly-fast full text search index using suffix trees (I just figured it out): 1) When building the index, add a suffix tree (or CDAWG, it's more compact) for each searcheable column in the table. So if you have 10,000,000 records you would have 10,000,000 "small" suffix trees. Or 20,000,000 if you have a summary field. The suffix trees would be of course, stored in variable-length records. You could have an index pointing to the suffix trees. 2) Each node in the tree would have a count number information. To search a word in the table, for all the records implement the suffix-tree search algorithm. To get the count of the given word, just read the counter on the corresponding suffix node. Ta-da! According to the algorithm, the suffix tree search is linear regarding phrase size, not data size. So, let's say that 'M' is the length of the word to be searched, and 'R' the number of records. So searching a word in the database would be O(M*R) complexity: the search time would not depend on the data size, but just on the number of records. Interesting, isn't it?
[14 Jan 2004 0:25]
Sergei Golubchik
Thanks for your ideas and pointers! What I know about CDAWG tells me that it's a read-only structure, or at least, it's very costly to update. Thats why I ruled it out from the very beginning. A drawback of suffix trees is that is tends to be big - much larger than inverted index. I suppose that CDAWG had a comparable size, but at a cost of being non-updetable. These structures are designed to solve different problem - substring search - not what fulltext search in MySQL does. They could be probably added as a new "substring" index - useful for - as you said - searching in DNA sequences. As for your algorithm, it's O(M), while searching in a btree is still O(log M). Actually it's O(log <size_of_data>), but still it grows slower as O(M). And for many-rows datasets btree will be faster. As for "CPU-intensive" - I don't have solid figures yet, but my experience is that the main problem is still disk i/o, not CPU. Of course, I mean boolean search - in natural language mode the most time-consuming task often is calculating relevances. CDAWG or suffix tree cannot help here :(
[23 Feb 2004 10:44]
Sergei Golubchik
discussion is moved out of bugdb