MySQL Bugs: #1935: Full-text Search doesn't work with HTML-Entities

Bug #1935	Full-text Search doesn't work with HTML-Entities
Submitted:	24 Nov 2003 13:41	Modified:	25 Nov 2003 2:55
Reporter:	[ name withheld ]	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server	Severity:	S3 (Non-critical)
Version:	4.0.15	OS:	Linux (linux)
Assigned to:		CPU Architecture:	Any

Description:
matching against an expression which includes HTML-Entities is not possible.
For Example:
  MATCH (field) AGAINST ('AAAA&ouml;AAAA' IN BOOLEAN MODE)
will return all fields which include "&ouml;" or "AAAA" since the internal
parser splits the "AAAA&ouml;AAAA"-Expression at "&" and ";".
The only way to avoid this problem is to use double quotes at the end and the
beginning of the word which contains the HTML-Entity. But even this work-around
will not give the expected result if you are using the asterisk-operater since
  MATCH (field) AGAINST ('"AAAA&ouml;AAAA*"' IN BOOLEAN MODE)
will commit a Full-text Search for the exact phrase: AAAA&ouml;AAAA*

How to repeat:
see description

Suggested fix:
splitting the AGAINST-Expression only at spaces. Characters like "&" and ";" are often part of a whole word. They should not be traeted as separators.

Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.mysql.com/documentation/ and the instructions on
how to report a bug at http://bugs.mysql.com/how-to-report.php

As Manual explicitly mentions:

"The MATCH() function performs a natural language search..."

It is in TODO list to add:

Support for "always-index words". They could be any strings the user wants to treat as words, examples are "C++", "AS/400", "TCP/IP", etc.

Make stopword list to depend of the language of the data.

To clarify Alexander's reply a bit - the manual also says:

MySQL uses a very simple parser to split text into words.  A "word" is
any sequence of characters consisting of letters, digits, `'', and `_'.
Any "word" that is present in the stopword list or is just too short
is ignored.

And I do not agree that ; is commonly used as a part of the word.

It is mainly not. And even in HTML it is NOT part of the word, but a part of the HTML entity. But we have in the todo a smart html-parser that properly recognizes HTML entities (and tags, btw).