Bug #85875 Stopword handling for ngram parser prevent searching meaningful words.
Submitted: 10 Apr 2017 6:43
Reporter: Seunguck Lee Email Updates:
Status: Open Impact on me:
None 
Category:MySQL Server: FULLTEXT search Severity:S4 (Feature request)
Version:5.7.17 OS:Any
Assigned to: CPU Architecture:Any
Tags: fulltext, NGRAM, stopword

[10 Apr 2017 6:43] Seunguck Lee
Description:
Stopword handling for ngram parser is a little bit weird.
Ngram fulltext parser does not index all token which contains(Not equal) stopword.

Currently, innodb ngram fulltext engine has below words as default builtin stopwords.
"a", "an", "are", "as", "at", "be", "by", "com", "de", "en", "for", "i", 
"in", "is", "it", "la", "of", "on", "or", "to“, ...
So, every token which has "a" or "an" or ...

This behavior prevent fulltext search from searching meaningful words for bigger ngram_token_size (bigger is depend on ngram token).

We can disable innodb_ft_enable_stopword ot OFF to avoid this.
But current stopword handling for ngram parser might lead users to users into error.
(Also default stopword is hidden in source code, so looks like this behavior is error prone)

How to repeat:
use test;
drop table ft_test;
set global innodb_ft_enable_stopword=on;
set innodb_ft_enable_stopword=on;

CREATE TABLE ft_test(
  id int not null,
  contents text,
  primary key(id),
  fulltext index fx_contents(contents) with parser ngram
) engine=innodb;

set global innodb_ft_aux_table='test/ft_test';
insert into ft_test values (1, 'department');

-- // ngram_token_size=2
mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE ;
+------+--------------+-------------+-----------+--------+----------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+------+--------------+-------------+-----------+--------+----------+
| ep   |            2 |           2 |         1 |      2 |        1 |
| me   |            2 |           2 |         1 |      2 |        6 |
| nt   |            2 |           2 |         1 |      2 |        8 |
| rt   |            2 |           2 |         1 |      2 |        4 |
| tm   |            2 |           2 |         1 |      2 |        5 |
+------+--------------+-------------+-----------+--------+----------+

mysql> select * from ft_test where match(contents) against('depart' in boolean mode);
+----+------------+
| id | contents   |
+----+------------+
|  1 | department |
+----+------------+
1 row in set (0.00 sec)

-- // ngram_token_size=5
mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE ;
+------+--------------+-------------+-----------+--------+----------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+------+--------------+-------------+-----------+--------+----------+
| rtme |            2 |           2 |         1 |      2 |        4 |
+------+--------------+-------------+-----------+--------+----------+

mysql> select * from ft_test where match(contents) against('depart' in boolean mode);
Empty set (0.00 sec)

Suggested fix:
Disable stopword for ngram parser (default mode) or 
Change stopword handling for ngram also same equal-match as InnoDB default parser.