Bug #4158 MySQL Fulltext Index doesn't Work for asian languages
Submitted: 16 Jun 2004 1:37 Modified: 14 Jun 2013 0:10
Reporter: Eks Wang Email Updates:
Status: Verified Impact on me:
Category:MySQL Server Severity:S4 (Feature request)
Version:4.1.12 OS:Windows (WINDOWS XP)
Assigned to: Assigned Account CPU Architecture:Any

[16 Jun 2004 1:37] Eks Wang
When creating fulltext index on a utf8 column, the index only pick up words that are separated by stop words (defined by built in stop word file). For example if I have a word 'ABC' in a sentence like 'XXXXABCXXX', MySQL doesn't regard this sentence as containing word 'ABC', unless the two 'X' at the boundary are stop words.

Since for most asian languages it's impossible to retrieve words from a string by looking at word boundaries, or, stop words. The current approach of will indexing will certainly not produce a reasonable index.

How to repeat:
1. Create a table with a text column, this text column is a charset of 'utf8'. Insert a reasonable large rows of chinese text into this table, make sure that you have some rows containing string like ' ABC ', and some rows contains string 'word1ABCword2'. 

2. Create fulltext index on this table.

3. Search for matching word 'ABC'.

Suggested fix:
Algorithm depending on stop words to identify index-able words will certainly fail to work for Asain languages.
[16 Jun 2004 2:45] Eks Wang
I just found that the stop words are not used either. I noticed some words are separated by stop words by but the text are not found by the search.
[16 Jun 2004 13:24] Sergei Golubchik
it's a known definiency - we don't have a correct algorithm of splitting Chinese (or Japanese) text into words. The workaround is to put non-word chartacters between words.