MySQL Bugs: #4158: MySQL Fulltext Index doesn't Work for asian languages

Bug #4158	MySQL Fulltext Index doesn't Work for asian languages
Submitted:	16 Jun 2004 1:37	Modified:	14 Jun 2013 0:10
Reporter:	Eks Wang	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server	Severity:	S4 (Feature request)
Version:	4.1.12	OS:	Windows (WINDOWS XP)
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
When creating fulltext index on a utf8 column, the index only pick up words that are separated by stop words (defined by built in stop word file). For example if I have a word 'ABC' in a sentence like 'XXXXABCXXX', MySQL doesn't regard this sentence as containing word 'ABC', unless the two 'X' at the boundary are stop words.

Since for most asian languages it's impossible to retrieve words from a string by looking at word boundaries, or, stop words. The current approach of will indexing will certainly not produce a reasonable index.

How to repeat:
1. Create a table with a text column, this text column is a charset of 'utf8'. Insert a reasonable large rows of chinese text into this table, make sure that you have some rows containing string like ' ABC ', and some rows contains string 'word1ABCword2'. 

2. Create fulltext index on this table.

3. Search for matching word 'ABC'.

Suggested fix:
Algorithm depending on stop words to identify index-able words will certainly fail to work for Asain languages.

I just found that the stop words are not used either. I noticed some words are separated by stop words by but the text are not found by the search.

it's a known definiency - we don't have a correct algorithm of splitting Chinese (or Japanese) text into words. The workaround is to put non-word chartacters between words.