Description:
Don't recursively-evaluate stopword after tokenizing by Ngram FT-Parser.
https://github.com/mysql/mysql-server/blob/mysql-5.7.13/storage/innobase/fts/fts0fts.cc#L4...
Japanese uses both of Japanese-Characters(Hiragana, Katakana, Kanji) and Alphabets in one sentence.
When evaluating stopword after tokenizing by Ngram FT-Parser(this is current implementation), alphabetical tokens are broken by recursive stopword filter.
For example, "BABY" is completely broken.
1. Ngram FT-Parser will tokenize it ["BA", "AB", "BY"]
2. Recursive stopword filter evaluate them [["B", "A", "BA"], ["A", "B", "AB"], ["B", "Y", "BY"]]
3-1. ["B", "A", "BA"] will match default-stopword "a"
3-2. ["A", "B", "AB"] will match default-stopward "a"
3-3. ["B", "Y", "BY"] will match default-stopword "by"
4. All tokens are dropped and can't search by word "BABY".
How to repeat:
```
mysql> CREATE DATABASE d1 CHARSET utf8mb4;
Query OK, 1 row affected (0.00 sec)
mysql> CREATE TABLE t1 (num serial, val varchar(32), FULLTEXT KEY fts_with_ngram (val) WITH PARSER ngram);
Query OK, 0 rows affected (5.53 sec)
mysql> INSERT INTO t1 VALUES (1, '泣かないでbaby');
Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM t1;
+-----+---------------------+
| num | val |
+-----+---------------------+
| 1 | 泣かないでbaby |
+-----+---------------------+
1 row in set (0.00 sec)
mysql> SELECT * FROM t1 WHERE MATCH(val) AGAINST('baby' IN BOOLEAN MODE);
Empty set (0.00 sec)
mysql> SET GLOBAL innodb_ft_aux_table = 'd1/t1';
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT * FROM information_schema.INNODB_FT_INDEX_CACHE ORDER BY position;
+--------+--------------+-------------+-----------+--------+----------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+--------+--------------+-------------+-----------+--------+----------+
| 泣か | 2 | 2 | 1 | 2 | 0 |
| かな | 2 | 2 | 1 | 2 | 3 |
| ない | 2 | 2 | 1 | 2 | 6 |
| いで | 2 | 2 | 1 | 2 | 9 |
| でb | 2 | 2 | 1 | 2 | 12 |
+--------+--------------+-------------+-----------+--------+----------+
5 rows in set (0.00 sec)
mysql> SELECT * FROM information_schema.INNODB_FT_DEFAULT_STOPWORD;
+-------+
| value |
+-------+
| a |
| about |
| an |
| are |
| as |
| at |
| be |
| by |
| com |
| de |
| en |
| for |
| from |
| how |
| i |
| in |
| is |
| it |
| la |
| of |
| on |
| or |
| that |
| the |
| this |
| to |
| was |
| what |
| when |
| where |
| who |
| will |
| with |
| und |
| the |
| www |
+-------+
36 rows in set (0.00 sec)
```
Suggested fix:
Don't recursive-evaluate stopword after tokenize.