Bug #84420 stopwords and ngram indexes
Submitted: 5 Jan 2017 11:19 Modified: 5 Jan 2017 12:16
Reporter: Miguel Angel Nieto Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: FULLTEXT search Severity:S3 (Non-critical)
Version:5.7.17 OS:Any
Assigned to: CPU Architecture:Any

[5 Jan 2017 11:19] Miguel Angel Nieto
Description:
Ngram indexes also check the stopwords list, to see if any indexed element *contain* one of the words on that list. This looks good and it is the normal behaviour, but I don't think that the default table is suitable to use it with ngram.

For example, any item that contains 'a' or 'i' will be ignored. So for example, if you have word "east", you cannot search for "ea" because it has been ignored.

Ngram should have a different default list of stopwords, or an empty list.

How to repeat:
mysql> CREATE TABLE `articles` ( 
`id` int(10) unsigned NOT NULL AUTO_INCREMENT, 
`body` text, 
PRIMARY KEY (`id`), 
FULLTEXT KEY `ftx` (`body`) /*!50100 WITH PARSER `ngram` */ 
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

mysql> insert into articles (body) values ('east'); 
mysql> insert into articles (body) values ('east area'); 
mysql> insert into articles (body) values ('east job'); 
mysql> insert into articles (body) values ('eastnation'); 
mysql> insert into articles (body) values ('eastway, try try');

mysql> SELECT * FROM articles WHERE MATCH(body) AGAINST('ea' IN BOOLEAN MODE); 
Empty set (0.00 sec)
[5 Jan 2017 12:16] MySQL Verification Team
Hello Miguel,

Thank you for the report and feedback!
Imho by default, the ngram parser uses the default stopword list.

Thanks,
Umesh
[5 Jan 2017 12:19] MySQL Verification Team
Currently, to define your own stopword list for all InnoDB tables, define a table with the same structure as the INNODB_FT_DEFAULT_STOPWORD table, populate it with stopwords, and set the value of the innodb_ft_server_stopword_table option to a value in the form db_name/table_name before creating the full-text index. The stopword table must have a single VARCHAR column named value. The following example demonstrates creating and configuring a new global stopword table for InnoDB. https://dev.mysql.com/doc/refman/5.7/en/fulltext-stopwords.html
[5 Jan 2017 12:20] MySQL Verification Team
-- with ngram_token_size=1

[umshastr@hod03]/export/umesh/server/binaries/GABuilds/mysql-5.7.17: bin/mysqld --no-defaults --basedir=/export/umesh/server/binaries/GABuilds/mysql-5.7.17 --datadir=/export/umesh/server/binaries/GABuilds/mysql-5.7.17/84108 --core-file --socket=/tmp/mysql_ushastry.sock --port=3306 --log-error=/export/umesh/server/binaries/GABuilds/mysql-5.7.17/84108/log.err --ngram_token_size=1 2>&1 &
[1] 20217
[umshastr@hod03]/export/umesh/server/binaries/GABuilds/mysql-5.7.17: bin/mysql -uroot -S /tmp/mysql_ushastry.sock                                                                                 Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.7.17 MySQL Community Server (GPL)

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

root@localhost [(none)]> use test
Database changed
root@localhost [test]> SELECT * FROM articles WHERE MATCH(body) AGAINST('ea' IN BOOLEAN MODE);
+----+------------------+
| id | body             |
+----+------------------+
|  2 | east area        |
|  1 | east             |
|  3 | east job         |
|  4 | eastnation       |
|  5 | eastway, try try |
+----+------------------+
5 rows in set (0.01 sec)

root@localhost [test]> show variables like 'ngram_token_size';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| ngram_token_size | 1     |
+------------------+-------+
1 row in set (0.00 sec)
[3 Apr 2017 11:53] Sveta Smirnova
Issue with changing INNODB_FT_DEFAULT_STOPWORD table is that other methods of full-text search are affected as well. For example, in this case is not possible to use mecab in one index and ngram in another.

Fix should either exclude INNODB_FT_DEFAULT_STOPWORD table for ngram search or create separate table for ngram.
[15 Mar 2019 7:02] MySQL Verification Team
Bug #94640 marked as duplicate of this one