Bug #1935 Full-text Search doesn't work with HTML-Entities
Submitted: 24 Nov 2003 13:41 Modified: 25 Nov 2003 2:55
Reporter: [ name withheld ] Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:4.0.15 OS:Linux (linux)
Assigned to: CPU Architecture:Any

[24 Nov 2003 13:41] [ name withheld ]
Description:
matching against an expression which includes HTML-Entities is not possible.
For Example:
  MATCH (field) AGAINST ('AAAAöAAAA' IN BOOLEAN MODE)
will return all fields which include "ö" or "AAAA" since the internal
parser splits the "AAAAöAAAA"-Expression at "&" and ";".
The only way to avoid this problem is to use double quotes at the end and the
beginning of the word which contains the HTML-Entity. But even this work-around
will not give the expected result if you are using the asterisk-operater since
  MATCH (field) AGAINST ('"AAAAöAAAA*"' IN BOOLEAN MODE)
will commit a Full-text Search for the exact phrase: AAAAöAAAA*

How to repeat:
see description

Suggested fix:
splitting the AGAINST-Expression only at spaces. Characters like "&" and ";" are often part of a whole word. They should not be traeted as separators.
[24 Nov 2003 14:42] Alexander Keremidarski
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.mysql.com/documentation/ and the instructions on
how to report a bug at http://bugs.mysql.com/how-to-report.php

As Manual explicitly mentions:

"The MATCH() function performs a natural language search..."

It is in TODO list to add:

Support for "always-index words". They could be any strings the user wants to treat as words, examples are "C++", "AS/400", "TCP/IP", etc.

Make stopword list to depend of the language of the data.
[25 Nov 2003 2:55] Sergei Golubchik
To clarify Alexander's reply a bit - the manual also says:

MySQL uses a very simple parser to split text into words.  A "word" is
any sequence of characters consisting of letters, digits, `'', and `_'.
Any "word" that is present in the stopword list or is just too short
is ignored.

And I do not agree that ; is commonly used as a part of the word.

It is mainly not. And even in HTML it is NOT part of the word, but a part of the HTML entity. But we have in the todo a smart html-parser that properly recognizes HTML entities (and tags, btw).