| Bug #1935 | Full-text Search doesn't work with HTML-Entities | ||
|---|---|---|---|
| Submitted: | 24 Nov 2003 13:41 | Modified: | 25 Nov 2003 2:55 |
| Reporter: | [ name withheld ] | Email Updates: | |
| Status: | Not a Bug | Impact on me: | |
| Category: | MySQL Server | Severity: | S3 (Non-critical) |
| Version: | 4.0.15 | OS: | Linux (linux) |
| Assigned to: | CPU Architecture: | Any | |
[24 Nov 2003 14:42]
Alexander Keremidarski
Thank you for taking the time to write to us, but this is not a bug. Please double-check the documentation available at http://www.mysql.com/documentation/ and the instructions on how to report a bug at http://bugs.mysql.com/how-to-report.php As Manual explicitly mentions: "The MATCH() function performs a natural language search..." It is in TODO list to add: Support for "always-index words". They could be any strings the user wants to treat as words, examples are "C++", "AS/400", "TCP/IP", etc. Make stopword list to depend of the language of the data.
[25 Nov 2003 2:55]
Sergei Golubchik
To clarify Alexander's reply a bit - the manual also says: MySQL uses a very simple parser to split text into words. A "word" is any sequence of characters consisting of letters, digits, `'', and `_'. Any "word" that is present in the stopword list or is just too short is ignored. And I do not agree that ; is commonly used as a part of the word. It is mainly not. And even in HTML it is NOT part of the word, but a part of the HTML entity. But we have in the todo a smart html-parser that properly recognizes HTML entities (and tags, btw).

Description: matching against an expression which includes HTML-Entities is not possible. For Example: MATCH (field) AGAINST ('AAAAöAAAA' IN BOOLEAN MODE) will return all fields which include "ö" or "AAAA" since the internal parser splits the "AAAAöAAAA"-Expression at "&" and ";". The only way to avoid this problem is to use double quotes at the end and the beginning of the word which contains the HTML-Entity. But even this work-around will not give the expected result if you are using the asterisk-operater since MATCH (field) AGAINST ('"AAAAöAAAA*"' IN BOOLEAN MODE) will commit a Full-text Search for the exact phrase: AAAAöAAAA* How to repeat: see description Suggested fix: splitting the AGAINST-Expression only at spaces. Characters like "&" and ";" are often part of a whole word. They should not be traeted as separators.