Bug #9087 Stop words are not ignored in BOOLEAN MODE subexpressions
Submitted: 10 Mar 2005 4:53 Modified: 20 May 2005 10:06
Reporter: David Gardiner Email Updates:
Status: Won't fix Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:4.1.10-nt OS:Windows (Windows XP Pro)
Assigned to: Sergei Golubchik CPU Architecture:Any

[10 Mar 2005 4:53] David Gardiner
Description:
Stop words appear to be handled incorrectly when in full-text boolean mode subexpressions.  In the How To Repeat section below I provide four examples based on the boolean query "+history +of +exposure".  Since "of" is a noise word, it should be ignored.  It is ignored in the simple expression, but when placed in a subexpression  - "+history +(of) +exposure" - it appears to be required (per the +) even though it does not exist in the full-text index.  Hence, no rows are returned.

When a non-noise word is placed in a subexpression, it is handled properly.

How to repeat:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 64 to server version: 4.1.10-nt

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> # Baseline, without noise word "of"
mysql> select distinct ConceptID, String
    -> from ConceptSynonym
    -> where match (String) against ("+history +exposure" in boolean mode);
+-----------+------------------------------------------------------------------------------------+
| ConceptID | String                                                                             |
+-----------+------------------------------------------------------------------------------------+
|     28165 | Personal history of exposure to nitrogen mustard compounds                         |
|    375812 | History of exposure to asbestos                                                    |
|    375813 | History of exposure to potentially hazardous body fluids                           |
|    375814 | History of exposure to lead                                                        |
|    488499 | HISTORY OF INDUSTRIAL EXPOSURE:FINDING:POINT IN TIME:^PATIENT:NARRATIVE:REPORTED   |
|    488499 | HISTORY OF INDUSTRIAL EXPOSURE:FIND:PT:^PATIENT:NAR:REPORTED                       |
|    488500 | HISTORY OF INDUSTRIAL EXPOSURE:FINDING:POINT IN TIME:^PATIENT:NOMINAL:REPORTED     |
|    488500 | HISTORY OF INDUSTRIAL EXPOSURE:FIND:PT:^PATIENT:NOM:REPORTED                       |
|    488503 | HISTORY OF OCCUPATIONAL EXPOSURE:FINDING:POINT IN TIME:^PATIENT:NARRATIVE:REPORTED |
|    488503 | HISTORY OF OCCUPATIONAL EXPOSURE:FIND:PT:^PATIENT:NAR:REPORTED                     |
|    488504 | HISTORY OF OCCUPATIONAL EXPOSURE:FINDING:POINT IN TIME:^PATIENT:NOMINAL:REPORTED   |
|    488504 | HISTORY OF OCCUPATIONAL EXPOSURE:FIND:PT:^PATIENT:NOM:REPORTED                     |
|    489535 | HISTORY OF INDUSTRIAL EXPOSURE                                                     |
|    489537 | HISTORY OF OCCUPATIONAL EXPOSURE                                                   |
|    375812 | Personal history of asbestos exposure                                              |
|     28165 | Personal history of mustard gas exposure                                           |
+-----------+------------------------------------------------------------------------------------+
16 rows in set (0.03 sec)

mysql>
mysql> # With noise word.  Works
mysql> select distinct ConceptID, String
    -> from ConceptSynonym
    -> where match (String) against ("+history +of +exposure" in boolean mode);
+-----------+------------------------------------------------------------------------------------+
| ConceptID | String                                                                             |
+-----------+------------------------------------------------------------------------------------+
|     28165 | Personal history of exposure to nitrogen mustard compounds                         |
|    375812 | History of exposure to asbestos                                                    |
|    375813 | History of exposure to potentially hazardous body fluids                           |
|    375814 | History of exposure to lead                                                        |
|    488499 | HISTORY OF INDUSTRIAL EXPOSURE:FINDING:POINT IN TIME:^PATIENT:NARRATIVE:REPORTED   |
|    488499 | HISTORY OF INDUSTRIAL EXPOSURE:FIND:PT:^PATIENT:NAR:REPORTED                       |
|    488500 | HISTORY OF INDUSTRIAL EXPOSURE:FINDING:POINT IN TIME:^PATIENT:NOMINAL:REPORTED     |
|    488500 | HISTORY OF INDUSTRIAL EXPOSURE:FIND:PT:^PATIENT:NOM:REPORTED                       |
|    488503 | HISTORY OF OCCUPATIONAL EXPOSURE:FINDING:POINT IN TIME:^PATIENT:NARRATIVE:REPORTED |
|    488503 | HISTORY OF OCCUPATIONAL EXPOSURE:FIND:PT:^PATIENT:NAR:REPORTED                     |
|    488504 | HISTORY OF OCCUPATIONAL EXPOSURE:FINDING:POINT IN TIME:^PATIENT:NOMINAL:REPORTED   |
|    488504 | HISTORY OF OCCUPATIONAL EXPOSURE:FIND:PT:^PATIENT:NOM:REPORTED                     |
|    489535 | HISTORY OF INDUSTRIAL EXPOSURE                                                     |
|    489537 | HISTORY OF OCCUPATIONAL EXPOSURE                                                   |
|    375812 | Personal history of asbestos exposure                                              |
|     28165 | Personal history of mustard gas exposure                                           |
+-----------+------------------------------------------------------------------------------------+
16 rows in set (0.03 sec)

mysql>
mysql> #|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
mysql> #VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV
mysql> # Noise word in subexpression.  DOES NOT WORK
mysql> select distinct ConceptID, String
    -> from ConceptSynonym
    -> where match (String) against ("+history +(of) +exposure" in boolean mode);
Empty set (0.03 sec)

mysql>
mysql> # Non-noise word in subexpression.  Works
mysql> select distinct ConceptID, String
    -> from ConceptSynonym
    -> where match (String) against ("+history +of +(exposure)" in boolean mode);
+-----------+------------------------------------------------------------------------------------+
| ConceptID | String                                                                             |
+-----------+------------------------------------------------------------------------------------+
|     28165 | Personal history of exposure to nitrogen mustard compounds                         |
|    375812 | History of exposure to asbestos                                                    |
|    375813 | History of exposure to potentially hazardous body fluids                           |
|    375814 | History of exposure to lead                                                        |
|    488499 | HISTORY OF INDUSTRIAL EXPOSURE:FINDING:POINT IN TIME:^PATIENT:NARRATIVE:REPORTED   |
|    488499 | HISTORY OF INDUSTRIAL EXPOSURE:FIND:PT:^PATIENT:NAR:REPORTED                       |
|    488500 | HISTORY OF INDUSTRIAL EXPOSURE:FINDING:POINT IN TIME:^PATIENT:NOMINAL:REPORTED     |
|    488500 | HISTORY OF INDUSTRIAL EXPOSURE:FIND:PT:^PATIENT:NOM:REPORTED                       |
|    488503 | HISTORY OF OCCUPATIONAL EXPOSURE:FINDING:POINT IN TIME:^PATIENT:NARRATIVE:REPORTED |
|    488503 | HISTORY OF OCCUPATIONAL EXPOSURE:FIND:PT:^PATIENT:NAR:REPORTED                     |
|    488504 | HISTORY OF OCCUPATIONAL EXPOSURE:FINDING:POINT IN TIME:^PATIENT:NOMINAL:REPORTED   |
|    488504 | HISTORY OF OCCUPATIONAL EXPOSURE:FIND:PT:^PATIENT:NOM:REPORTED                     |
|    489535 | HISTORY OF INDUSTRIAL EXPOSURE                                                     |
|    489537 | HISTORY OF OCCUPATIONAL EXPOSURE                                                   |
|    375812 | Personal history of asbestos exposure                                              |
|     28165 | Personal history of mustard gas exposure                                           |
+-----------+------------------------------------------------------------------------------------+
16 rows in set (0.05 sec)

mysql>

Suggested fix:
Look at how noise words are handled in boolean mode subexpressions.
[20 May 2005 6:37] Vasily Kishkin
Could you please to write here the table definition ?
[20 May 2005 10:02] Sergei Golubchik
I believe you're right, but I don't think we can fix it in any logical and consistent way.

"of" in +(of) is no different from

  MATCH ... AGAINST ("of" IN BOOLEAN MODE)

and the latter returns empty set.

"+history +(of) +exposure" is translated into "+history +() +exposure"
and the latter does not match anything because empty subexpression is never matched
[20 May 2005 10:06] Sergei Golubchik
The ultimate fix would be to make boolean search independendent of stopwords.
It's in the todo.