Bug #83776 InnoDB FULLTEXT search returns incorrect result for operators on ignored words
Submitted: 10 Nov 2016 20:42 Modified: 11 Nov 2016 18:45
Reporter: Dillon Sadofsky Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: FULLTEXT search Severity:S3 (Non-critical)
Version:5.6/5.7/8.0 OS:Any
Assigned to: CPU Architecture:Any
Tags: fulltext, innodb, operator

[10 Nov 2016 20:42] Dillon Sadofsky
Description:
We have an application where users can 'filter' results with a FTS on an InnoDB table.  We automatically put + in front of each word in order to make sure that typing more words results in fewer results instead of more.  However, if the user enters a stopword or a word shorter than the minimum word length (and probably longer than max word length, though I haven't tested) the result set is always empty.

Essentially, if min wordlength is 3, doing a FTS for "+LONGWORD" works as expected, but "+S +LONGWORD" returns nothing, despite the fact that the desired result contains both words.  I understand that the short word is thrown out of consideration, but I'd expect the requirement (+) on that word to be thrown out with it.

According to the documentation on InnoDB fulltext searches and their operators (http://dev.mysql.com/doc/refman/5.7/en/fulltext-boolean.html):
"A leading or trailing plus sign indicates that this word must be present in each row that is returned. InnoDB only supports leading plus signs."

Obviously, the word is present in the results, so the documentation is a bit misleading.  We're ok with short and stopwords being ignored, but its weird that words that are ignored that are present in the results are causing a search that cannot find any rows.

How to repeat:
Create table `test`(  
  `value` varchar(100) NOT NULL,
  FULLTEXT index `search` (`value`)
) ENGINE = InnoDB;

TRUNCATE test;
INSERT INTO test (`value`) VALUES ('LONGWORD MEDIUM');
INSERT INTO test (`value`) VALUES ('LONGWORD MEDIUM S');
INSERT INTO test (`value`) VALUES ('LONGWORD S');
INSERT INTO test (`value`) VALUES ('LONGWORD WAS MEDIUM');

# Works as expected
SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD' IN BOOLEAN MODE);
# If S (as a short word) is ignored, you'd expect this to be the same as the above, but it will find nothing
SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD +S' IN BOOLEAN MODE);
# Similarly, since AND is a stopword, even if the server is set up to search 3 letter words, this also finds nothing
SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD +WAS' IN BOOLEAN MODE);
# No operator allows the stopword to be truly optional, resulting in what we expect:
SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD WAS' IN BOOLEAN MODE);

DROP TABLE `test`;

Suggested fix:
We do not disagree that words outside of the indexed word length range, as well as stopwords, should be ignored from consideration of InnoDB FTS.  However, if they are ignored, any operators that modify them should also be ignored from consideration.  Otherwise, the results do not match the expected result set, which is confusing to the user.  

Since our end-users are not aware of the stopword list or length limitations, this makes FTS filtering seem non-deterministic.

I'm guessing some would suggest that I simply make our application recognize words outside of the length restrictions (as well as stopwords) and simply not add operators to them.  However, this passes and undue burden down to the client application to keep an up to date stopword list as well as dynamically load the min/max wordlength variables from the server in order to know what words to not modify.  Since I think those variables are only accessible in the config file (and not in session variables) I don't even think I can do that.

Our application is designed to run against various MySQL versions and configurations, so we cannot hard-code rules for what words to 'skip' when filtering.  We'd instead prefer to have words that are 'ignored' be *truly* ignored instead of only partially ignored in a FTS.
[10 Nov 2016 20:43] Dillon Sadofsky
More specific version #'s added.
[11 Nov 2016 2:24] MySQL Verification Team
c:\dbs>c:\dbs\5.7\bin\mysql -uroot --port=3570 -p --prompt="mysql 5.7 > "
Enter password: ******
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 4
Server version: 5.7.18-debug Source distribution PULL: 2016-NOV-05

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql 5.7 > create database k;
Query OK, 1 row affected (0.06 sec)

mysql 5.7 > use k
Database changed
mysql 5.7 > Create table `test`(
    ->   `value` varchar(100) NOT NULL,
    ->   FULLTEXT index `search` (`value`)
    -> ) ENGINE = InnoDB;
Query OK, 0 rows affected (2.78 sec)

mysql 5.7 >
mysql 5.7 > TRUNCATE test;
Query OK, 0 rows affected (2.72 sec)

mysql 5.7 > INSERT INTO test (`value`) VALUES ('LONGWORD MEDIUM');
Query OK, 1 row affected (0.11 sec)

mysql 5.7 > INSERT INTO test (`value`) VALUES ('LONGWORD MEDIUM S');
Query OK, 1 row affected (0.03 sec)

mysql 5.7 > INSERT INTO test (`value`) VALUES ('LONGWORD S');
Query OK, 1 row affected (0.06 sec)

mysql 5.7 > INSERT INTO test (`value`) VALUES ('LONGWORD WAS MEDIUM');
Query OK, 1 row affected (0.01 sec)

mysql 5.7 > # Works as expected
mysql 5.7 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD' IN BOOLEAN MODE);
+---------------------+
| value               |
+---------------------+
| LONGWORD MEDIUM     |
| LONGWORD MEDIUM S   |
| LONGWORD S          |
| LONGWORD WAS MEDIUM |
+---------------------+
4 rows in set (0.05 sec)

mysql 5.7 > # If S (as a short word) is ignored, you'd expect this to be the same as the above, but it will find nothing
mysql 5.7 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD +S' IN BOOLEAN MODE)
    -> ;
Empty set (0.00 sec)

mysql 5.7 > # Similarly, since AND is a stopword, even if the server is set up to search 3 letter words, this also finds nothing
mysql 5.7 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD +WAS' IN BOOLEAN MODE)
    -> ;
Empty set (0.00 sec)

mysql 5.7 > # No operator allows the stopword to be truly optional, resulting in what we expect:
mysql 5.7 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD WAS' IN BOOLEAN MODE);
+---------------------+
| value               |
+---------------------+
| LONGWORD MEDIUM     |
| LONGWORD MEDIUM S   |
| LONGWORD S          |
| LONGWORD WAS MEDIUM |
+---------------------+
4 rows in set (0.00 sec)

mysql 5.7 > alter table test engine MyISAM;
Query OK, 4 rows affected (0.91 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql 5.7 > # If S (as a short word) is ignored, you'd expect this to be the same as the above, but it will find nothing
mysql 5.7 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD +S' IN BOOLEAN MODE);
+---------------------+
| value               |
+---------------------+
| LONGWORD MEDIUM     |
| LONGWORD MEDIUM S   |
| LONGWORD S          |
| LONGWORD WAS MEDIUM |
+---------------------+
4 rows in set (0.00 sec)

mysql 5.7 > # Similarly, since AND is a stopword, even if the server is set up to search 3 letter words, this also finds nothing
mysql 5.7 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD +WAS' IN BOOLEAN MODE);
+---------------------+
| value               |
+---------------------+
| LONGWORD MEDIUM     |
| LONGWORD MEDIUM S   |
| LONGWORD S          |
| LONGWORD WAS MEDIUM |
+---------------------+
4 rows in set (0.00 sec)

mysql 5.7 > # No operator allows the stopword to be truly optional, resulting in what we expect:
mysql 5.7 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD WAS' IN BOOLEAN MODE);
+---------------------+
| value               |
+---------------------+
| LONGWORD MEDIUM     |
| LONGWORD MEDIUM S   |
| LONGWORD S          |
| LONGWORD WAS MEDIUM |
+---------------------+
4 rows in set (0.00 sec)

mysql 5.7 >
[11 Nov 2016 18:45] MySQL Verification Team
Thank you for the bug report.

miguel@ural:~/dbs> ./56c
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.6.36 Source distribution  PULL: 2016-NOV-11

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql 5.6 > CREATE DATABASE ah;
Query OK, 1 row affected (0.00 sec)

mysql 5.6 > USE ah
Database changed
mysql 5.6 > 
mysql 5.6 > Create table `test`(  
    ->   `value` varchar(100) NOT NULL,
    ->   FULLTEXT index `search` (`value`)
    -> ) ENGINE = InnoDB;
Query OK, 0 rows affected (2.94 sec)

mysql 5.6 > 
mysql 5.6 > TRUNCATE test;
Query OK, 0 rows affected (2.49 sec)

mysql 5.6 > INSERT INTO test (`value`) VALUES ('LONGWORD MEDIUM');
Query OK, 1 row affected (0.05 sec)

mysql 5.6 > INSERT INTO test (`value`) VALUES ('LONGWORD MEDIUM S');
Query OK, 1 row affected (0.05 sec)

mysql 5.6 > INSERT INTO test (`value`) VALUES ('LONGWORD S');
Query OK, 1 row affected (0.16 sec)

mysql 5.6 > INSERT INTO test (`value`) VALUES ('LONGWORD WAS MEDIUM');
Query OK, 1 row affected (0.06 sec)

mysql 5.6 > 
mysql 5.6 > # Works as expected
mysql 5.6 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD' IN BOOLEAN MODE);
+---------------------+
| value               |
+---------------------+
| LONGWORD MEDIUM     |
| LONGWORD MEDIUM S   |
| LONGWORD S          |
| LONGWORD WAS MEDIUM |
+---------------------+
4 rows in set (0.00 sec)

mysql 5.6 > # If S (as a short word) is ignored, you'd expect this to be the same as the above, but it will find nothing
mysql 5.6 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD +S' IN BOOLEAN MODE);
Empty set (0.00 sec)

mysql 5.6 > # Similarly, since AND is a stopword, even if the server is set up to search 3 letter words, this also finds nothing
mysql 5.6 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD +WAS' IN BOOLEAN MODE);
Empty set (0.00 sec)

mysql 5.6 > # No operator allows the stopword to be truly optional, resulting in what we expect:
mysql 5.6 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD WAS' IN BOOLEAN MODE);
+---------------------+
| value               |
+---------------------+
| LONGWORD MEDIUM     |
| LONGWORD MEDIUM S   |
| LONGWORD S          |
| LONGWORD WAS MEDIUM |
+---------------------+
4 rows in set (0.00 sec)

mysql 5.6 > ALTER TABLE test ENGINE MyISAM;
Query OK, 4 rows affected (0.38 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql 5.6 > # Works as expected
mysql 5.6 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD' IN BOOLEAN MODE);
+---------------------+
| value               |
+---------------------+
| LONGWORD MEDIUM     |
| LONGWORD MEDIUM S   |
| LONGWORD S          |
| LONGWORD WAS MEDIUM |
+---------------------+
4 rows in set (0.01 sec)

mysql 5.6 > # If S (as a short word) is ignored, you'd expect this to be the same as the above, but it will find nothing
mysql 5.6 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD +S' IN BOOLEAN MODE);
+---------------------+
| value               |
+---------------------+
| LONGWORD MEDIUM     |
| LONGWORD MEDIUM S   |
| LONGWORD S          |
| LONGWORD WAS MEDIUM |
+---------------------+
4 rows in set (0.00 sec)

mysql 5.6 > # Similarly, since AND is a stopword, even if the server is set up to search 3 letter words, this also finds nothing
mysql 5.6 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD +WAS' IN BOOLEAN MODE);
+---------------------+
| value               |
+---------------------+
| LONGWORD MEDIUM     |
| LONGWORD MEDIUM S   |
| LONGWORD S          |
| LONGWORD WAS MEDIUM |
+---------------------+
4 rows in set (0.00 sec)

mysql 5.6 > # No operator allows the stopword to be truly optional, resulting in what we expect:
mysql 5.6 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD WAS' IN BOOLEAN MODE);
+---------------------+
| value               |
+---------------------+
| LONGWORD MEDIUM     |
| LONGWORD MEDIUM S   |
| LONGWORD S          |
| LONGWORD WAS MEDIUM |
+---------------------+
4 rows in set (0.00 sec)

mysql 5.6 >
[11 Nov 2016 18:50] MySQL Verification Team
miguel@tikal:~/dbs $ ./80c
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 4
Server version: 8.0.1-dmr Source distribution PULL: 2016-NOV-05

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql 8.0 > CREATE DATABASE ah;
Query OK, 1 row affected (0.11 sec)

mysql 8.0 > USE ah
Database changed
mysql 8.0 > Create table `test`(  
    ->   `value` varchar(100) NOT NULL,
    ->   FULLTEXT index `search` (`value`)
    -> ) ENGINE = InnoDB;
Query OK, 0 rows affected (2.15 sec)

mysql 8.0 > 
mysql 8.0 > TRUNCATE test;
Query OK, 0 rows affected (4.31 sec)

mysql 8.0 > INSERT INTO test (`value`) VALUES ('LONGWORD MEDIUM');
Query OK, 1 row affected (0.03 sec)

mysql 8.0 > INSERT INTO test (`value`) VALUES ('LONGWORD MEDIUM S');
Query OK, 1 row affected (0.06 sec)

mysql 8.0 > INSERT INTO test (`value`) VALUES ('LONGWORD S');
Query OK, 1 row affected (0.15 sec)

mysql 8.0 > INSERT INTO test (`value`) VALUES ('LONGWORD WAS MEDIUM');
Query OK, 1 row affected (0.04 sec)

mysql 8.0 > 
mysql 8.0 > # Works as expected
mysql 8.0 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD' IN BOOLEAN MODE);
+---------------------+
| value               |
+---------------------+
| LONGWORD MEDIUM     |
| LONGWORD MEDIUM S   |
| LONGWORD S          |
| LONGWORD WAS MEDIUM |
+---------------------+
4 rows in set (0.00 sec)

mysql 8.0 > # If S (as a short word) is ignored, you'd expect this to be the same as the above, but it will find nothing
mysql 8.0 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD +S' IN BOOLEAN MODE);
Empty set (0.00 sec)

mysql 8.0 > # Similarly, since AND is a stopword, even if the server is set up to search 3 letter words, this also finds nothing
mysql 8.0 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD +WAS' IN BOOLEAN MODE);
Empty set (0.00 sec)

mysql 8.0 > # No operator allows the stopword to be truly optional, resulting in what we expect:
mysql 8.0 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD WAS' IN BOOLEAN MODE);
+---------------------+
| value               |
+---------------------+
| LONGWORD MEDIUM     |
| LONGWORD MEDIUM S   |
| LONGWORD S          |
| LONGWORD WAS MEDIUM |
+---------------------+
4 rows in set (0.00 sec)

mysql 8.0 > ALTER TABLE test ENGINE MyISAM;
Query OK, 4 rows affected (0.77 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql 8.0 > # Works as expected
mysql 8.0 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD' IN BOOLEAN MODE);
+---------------------+
| value               |
+---------------------+
| LONGWORD MEDIUM     |
| LONGWORD MEDIUM S   |
| LONGWORD S          |
| LONGWORD WAS MEDIUM |
+---------------------+
4 rows in set (0.00 sec)

mysql 8.0 > # If S (as a short word) is ignored, you'd expect this to be the same as the above, but it will find nothing
mysql 8.0 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD +S' IN BOOLEAN MODE);
+---------------------+
| value               |
+---------------------+
| LONGWORD MEDIUM     |
| LONGWORD MEDIUM S   |
| LONGWORD S          |
| LONGWORD WAS MEDIUM |
+---------------------+
4 rows in set (0.00 sec)

mysql 8.0 > # Similarly, since AND is a stopword, even if the server is set up to search 3 letter words, this also finds nothing
mysql 8.0 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD +WAS' IN BOOLEAN MODE);
+---------------------+
| value               |
+---------------------+
| LONGWORD MEDIUM     |
| LONGWORD MEDIUM S   |
| LONGWORD S          |
| LONGWORD WAS MEDIUM |
+---------------------+
4 rows in set (0.00 sec)

mysql 8.0 > # No operator allows the stopword to be truly optional, resulting in what we expect:
mysql 8.0 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD WAS' IN BOOLEAN MODE);
+---------------------+
| value               |
+---------------------+
| LONGWORD MEDIUM     |
| LONGWORD MEDIUM S   |
| LONGWORD S          |
| LONGWORD WAS MEDIUM |
+---------------------+
4 rows in set (0.00 sec)

mysql 8.0 > # Works as expected
mysql 8.0 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD' IN BOOLEAN MODE);
+---------------------+
| value               |
+---------------------+
| LONGWORD MEDIUM     |
| LONGWORD MEDIUM S   |
| LONGWORD S          |
| LONGWORD WAS MEDIUM |
+---------------------+
4 rows in set (0.00 sec)

mysql 8.0 > # If S (as a short word) is ignored, you'd expect this to be the same as the above, but it will find nothing
mysql 8.0 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD +S' IN BOOLEAN MODE);
+---------------------+
| value               |
+---------------------+
| LONGWORD MEDIUM     |
| LONGWORD MEDIUM S   |
| LONGWORD S          |
| LONGWORD WAS MEDIUM |
+---------------------+
4 rows in set (0.00 sec)

mysql 8.0 > # Similarly, since AND is a stopword, even if the server is set up to search 3 letter words, this also finds nothing
mysql 8.0 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD +WAS' IN BOOLEAN MODE);
+---------------------+
| value               |
+---------------------+
| LONGWORD MEDIUM     |
| LONGWORD MEDIUM S   |
| LONGWORD S          |
| LONGWORD WAS MEDIUM |
+---------------------+
4 rows in set (0.00 sec)

mysql 8.0 > # No operator allows the stopword to be truly optional, resulting in what we expect:
mysql 8.0 > SELECT * FROM test WHERE MATCH(`value`) AGAINST ('+LONGWORD WAS' IN BOOLEAN MODE);
+---------------------+
| value               |
+---------------------+
| LONGWORD MEDIUM     |
| LONGWORD MEDIUM S   |
| LONGWORD S          |
| LONGWORD WAS MEDIUM |
+---------------------+
4 rows in set (0.00 sec)

mysql 8.0 >