Bug #82330 Don't recursively-evaluate stopword after tokenize
Submitted: 25 Jul 2016 5:55 Modified: 20 May 3:46
Reporter: tsubasa tanaka (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: FULLTEXT search Severity:S1 (Critical)
Version:5.7.13, 5.7.22, 8.0.16 OS:CentOS (6.6)
Assigned to: CPU Architecture:Any
Tags: fulltext, NGRAM

[25 Jul 2016 5:55] tsubasa tanaka
Description:
Don't recursively-evaluate stopword after tokenizing by Ngram FT-Parser.

https://github.com/mysql/mysql-server/blob/mysql-5.7.13/storage/innobase/fts/fts0fts.cc#L4...

Japanese uses both of Japanese-Characters(Hiragana, Katakana, Kanji) and Alphabets in one sentence.
When evaluating stopword after tokenizing by Ngram FT-Parser(this is current implementation), alphabetical tokens are broken by recursive stopword filter.

For example, "BABY" is completely broken.

1. Ngram FT-Parser will tokenize it ["BA", "AB", "BY"]
2. Recursive stopword filter evaluate them [["B", "A", "BA"], ["A", "B", "AB"], ["B", "Y", "BY"]]
3-1. ["B", "A", "BA"] will match default-stopword "a"
3-2. ["A", "B", "AB"] will match default-stopward "a"
3-3. ["B", "Y", "BY"] will match default-stopword "by"
4. All tokens are dropped and can't search by word "BABY".

How to repeat:
```
mysql> CREATE DATABASE d1 CHARSET utf8mb4;
Query OK, 1 row affected (0.00 sec)

mysql> CREATE TABLE t1 (num serial, val varchar(32), FULLTEXT KEY fts_with_ngram (val) WITH PARSER ngram);
Query OK, 0 rows affected (5.53 sec)

mysql> INSERT INTO t1 VALUES (1, '泣かないでbaby');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM t1;
+-----+---------------------+
| num | val                 |
+-----+---------------------+
|   1 | 泣かないでbaby      |
+-----+---------------------+
1 row in set (0.00 sec)

mysql> SELECT * FROM t1 WHERE MATCH(val) AGAINST('baby' IN BOOLEAN MODE);
Empty set (0.00 sec)

mysql> SET GLOBAL innodb_ft_aux_table = 'd1/t1';
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT * FROM information_schema.INNODB_FT_INDEX_CACHE ORDER BY position;
+--------+--------------+-------------+-----------+--------+----------+
| WORD   | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+--------+--------------+-------------+-----------+--------+----------+
| 泣か   |            2 |           2 |         1 |      2 |        0 |
| かな   |            2 |           2 |         1 |      2 |        3 |
| ない   |            2 |           2 |         1 |      2 |        6 |
| いで   |            2 |           2 |         1 |      2 |        9 |
| でb    |            2 |           2 |         1 |      2 |       12 |
+--------+--------------+-------------+-----------+--------+----------+
5 rows in set (0.00 sec)

mysql> SELECT * FROM information_schema.INNODB_FT_DEFAULT_STOPWORD;
+-------+
| value |
+-------+
| a     |
| about |
| an    |
| are   |
| as    |
| at    |
| be    |
| by    |
| com   |
| de    |
| en    |
| for   |
| from  |
| how   |
| i     |
| in    |
| is    |
| it    |
| la    |
| of    |
| on    |
| or    |
| that  |
| the   |
| this  |
| to    |
| was   |
| what  |
| when  |
| where |
| who   |
| will  |
| with  |
| und   |
| the   |
| www   |
+-------+
36 rows in set (0.00 sec)
```

Suggested fix:
Don't recursive-evaluate stopword after tokenize.
[25 Jul 2016 6:09] Umesh Shastry
Hello Tanaka-San,

Thank you for the report and test case.
Verified as described with 5.7.13 build.

Thanks,
Umesh
[28 Jun 2018 9:12] Umesh Shastry
Bug #91437 marked as duplicate of this one
[20 May 3:46] tsubasa tanaka
8.0.16 is affected too.