MySQL Bugs: #50969: FT min/max has to be overridable per-use (current impl. assumes text is western)

Bug #50969	FT min/max has to be overridable per-use (current impl. assumes text is western)
Submitted:	7 Feb 2010 10:09	Modified:	7 Feb 2010 10:28
Reporter:	Jason Urrich	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: FULLTEXT search	Severity:	S3 (Non-critical)
Version:		OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	ft_min_word_len

Description:
The value of ft_min_word_len has to be overridable on a per-use basis, as these values reflect the belief that the text consists of "letters", which is not true for languages used by half the world's population. A slight problem =)

The current values are based on the idea that the data consists of alphabetic letters, and that words composed with such letters are meaningless in searches unless they contain enough of them. This makes sense for western languages, where the typical meaningful word is at least three, but often four or more letters long.

However, for any language with compound letters (such as Korean) or ideographs (such as Chinese or Japanese), this assumption is invalid. In these languages, what MySQL sees as a single letter (such as the 'letters' 곈 or 風) is actually a full word already, and any index built with ft_min_word_len higher than 1 will fail to index them.

Essentially, MySQL is not just making assumptions about what textual data should look like, but currently *prescribes* what textual data is allowed to look like - arguably a bug caused by programmer oversight, but at least one that should be easy to fix.

To elaborate on which this is a serious problem, consider organisations and their users:

The lack of a per-creation way to set the index's minimum substring length means that either every user of the database has to live with the fact that they won't be able to use mysql for their data indexing if they happen to have data in a language that doesn't conform to MySQL assumptions about what textual data looks like, or that in order for an organisation to support those users, every user will have fulltext indexes with substrings of 1 letter or longer, which will yield huge index trees with mostly irrelevant data for languages that use an alphabet.

Right now, by relying on a single global value that cannot be modified on a per-use basis, full text search is effectively heavily language biased. This makes it impossible for organisations to offer MySQL database services for clients/users who rely on both English and east Asian data (or any alphabetic and ideographic language combination, really).

A solution is global default values, but per-declared-index overriding of the min/max values.

How to repeat:
there is no syntax for declaring min/max length at the moment, so the description covers it all.

Suggested fix:
My suggested solution to this almostbug is to make the global definitions for ft_min_word_len and ft_max_word_len act only as default values, to be used as a fallback when a user does not explicitly specify what these values should be when they are declaring a full text index

As example:

CREATE TABLE sometable (id INT PRIMARY KEY, data TEXT, FULLTEXT(data))

should map to:

CREATE TABLE sometable (id INT PRIMARY KEY, data TEXT, FULLTEXT(data,ft_min_word_len,ft_max_word_len))

With the availability of this explicit sytanx, a user could also specify

CREATE TABLE sometable (id INT PRIMARY KEY, data TEXT, FULLTEXT(data,1))

which should map to:

CREATE TABLE sometable (id INT PRIMARY KEY, data TEXT, FULLTEXT(data,1,ft_max_word_len))

or they could take full control and specify something like:

CREATE TABLE sometable (id INT PRIMARY KEY, data TEXT, FULLTEXT(data,1,10))

---

This solution should make MySQL unbiased towards what languages consider "letters", as the user will now be able to tell MySQL what "letters" are supposd to mean for the textual data for which an index is declared.

Thank you for the feature request.