MySQL Bugs: #82330: Don't recursively-evaluate stopword after tokenize

Bug #82330	Don't recursively-evaluate stopword after tokenize
Submitted:	25 Jul 2016 5:55	Modified:	6 Apr 2024 9:15
Reporter:	Tsubasa Tanaka (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: FULLTEXT search	Severity:	S1 (Critical)
Version:	5.7.13, 5.7.22, 8.0.16, 8.3.0	OS:	CentOS (6.6)
Assigned to:		CPU Architecture:	Any
Tags:	fulltext, NGRAM

Description:
Don't recursively-evaluate stopword after tokenizing by Ngram FT-Parser.

https://github.com/mysql/mysql-server/blob/mysql-5.7.13/storage/innobase/fts/fts0fts.cc#L4...

Japanese uses both of Japanese-Characters(Hiragana, Katakana, Kanji) and Alphabets in one sentence.
When evaluating stopword after tokenizing by Ngram FT-Parser(this is current implementation), alphabetical tokens are broken by recursive stopword filter.

For example, "BABY" is completely broken.

1. Ngram FT-Parser will tokenize it ["BA", "AB", "BY"]
2. Recursive stopword filter evaluate them [["B", "A", "BA"], ["A", "B", "AB"], ["B", "Y", "BY"]]
3-1. ["B", "A", "BA"] will match default-stopword "a"
3-2. ["A", "B", "AB"] will match default-stopward "a"
3-3. ["B", "Y", "BY"] will match default-stopword "by"
4. All tokens are dropped and can't search by word "BABY".

How to repeat:
```
mysql> CREATE DATABASE d1 CHARSET utf8mb4;
Query OK, 1 row affected (0.00 sec)

mysql> CREATE TABLE t1 (num serial, val varchar(32), FULLTEXT KEY fts_with_ngram (val) WITH PARSER ngram);
Query OK, 0 rows affected (5.53 sec)

mysql> INSERT INTO t1 VALUES (1, '泣かないでbaby');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM t1;
+-----+---------------------+
| num | val                 |
+-----+---------------------+
|   1 | 泣かないでbaby      |
+-----+---------------------+
1 row in set (0.00 sec)

mysql> SELECT * FROM t1 WHERE MATCH(val) AGAINST('baby' IN BOOLEAN MODE);
Empty set (0.00 sec)

mysql> SET GLOBAL innodb_ft_aux_table = 'd1/t1';
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT * FROM information_schema.INNODB_FT_INDEX_CACHE ORDER BY position;
+--------+--------------+-------------+-----------+--------+----------+
| WORD   | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+--------+--------------+-------------+-----------+--------+----------+
| 泣か   |            2 |           2 |         1 |      2 |        0 |
| かな   |            2 |           2 |         1 |      2 |        3 |
| ない   |            2 |           2 |         1 |      2 |        6 |
| いで   |            2 |           2 |         1 |      2 |        9 |
| でb    |            2 |           2 |         1 |      2 |       12 |
+--------+--------------+-------------+-----------+--------+----------+
5 rows in set (0.00 sec)

mysql> SELECT * FROM information_schema.INNODB_FT_DEFAULT_STOPWORD;
+-------+
| value |
+-------+
| a     |
| about |
| an    |
| are   |
| as    |
| at    |
| be    |
| by    |
| com   |
| de    |
| en    |
| for   |
| from  |
| how   |
| i     |
| in    |
| is    |
| it    |
| la    |
| of    |
| on    |
| or    |
| that  |
| the   |
| this  |
| to    |
| was   |
| what  |
| when  |
| where |
| who   |
| will  |
| with  |
| und   |
| the   |
| www   |
+-------+
36 rows in set (0.00 sec)
```

Suggested fix:
Don't recursive-evaluate stopword after tokenize.

Hello Tanaka-San,

Thank you for the report and test case.
Verified as described with 5.7.13 build.

Thanks,
Umesh

Bug #91437 marked as duplicate of this one

8.0.16 is affected too.

I seem MySQL 8.3.0 has not been affected this issue.

mysql83 10> INSERT INTO t1 VALUES (1, '泣かないでbaby');
Query OK, 1 row affected (0.01 sec)

mysql83 10> SELECT * FROM t1 WHERE MATCH(val) AGAINST('baby' IN BOOLEAN MODE);
+-----+---------------------+
| num | val                 |
+-----+---------------------+
|   1 | 泣かないでbaby      |
+-----+---------------------+
1 row in set (0.02 sec)

mysql83 10> SET GLOBAL innodb_ft_aux_table = 'd1/t1';
Query OK, 0 rows affected (0.00 sec)

mysql83 10> SELECT * FROM information_schema.INNODB_FT_INDEX_CACHE ORDER BY position;
+------+--------------+-------------+-----------+--------+----------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+------+--------------+-------------+-----------+--------+----------+
| 泣   |            2 |           2 |         1 |      2 |        0 |
| b    |            2 |           2 |         1 |      2 |        2 |
| か   |            2 |           2 |         1 |      2 |        3 |
| な   |            2 |           2 |         1 |      2 |        6 |
| い   |            2 |           2 |         1 |      2 |        9 |
| で   |            2 |           2 |         1 |      2 |       12 |
| b    |            2 |           2 |         1 |      2 |       15 |
| y    |            2 |           2 |         1 |      2 |       18 |
+------+--------------+-------------+-----------+--------+----------+
8 rows in set (0.01 sec)

Sorry, above comment is wrong (My environment is set ngram_token_size=1
When ngram_token_size=2, this issue can be reproduced as of 8.3.0

mysql83 8> SELECT @@ngram_token_size;
+--------------------+
| @@ngram_token_size |
+--------------------+
|                  2 |
+--------------------+
1 row in set (0.00 sec)

mysql83 8> CREATE TABLE t1 (num serial, val varchar(32), FULLTEXT KEY fts_with_ngram (val) WITH PARSER ngram);
Query OK, 0 rows affected (0.12 sec)

mysql83 8> INSERT INTO t1 VALUES (1, '泣かないでbaby');
Query OK, 1 row affected (0.01 sec)

mysql83 8> SELECT * FROM t1 WHERE MATCH(val) AGAINST('baby' IN BOOLEAN MODE);
Empty set (0.00 sec)

mysql83 8> SET GLOBAL innodb_ft_aux_table = 'd1/t1';
Query OK, 0 rows affected (0.00 sec)

mysql83 8> SELECT * FROM information_schema.INNODB_FT_INDEX_CACHE ORDER BY position;
+--------+--------------+-------------+-----------+--------+----------+
| WORD   | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+--------+--------------+-------------+-----------+--------+----------+
| 泣か   |            2 |           2 |         1 |      2 |        0 |
| かな   |            2 |           2 |         1 |      2 |        3 |
| ない   |            2 |           2 |         1 |      2 |        6 |
| いで   |            2 |           2 |         1 |      2 |        9 |
| でb    |            2 |           2 |         1 |      2 |       12 |
+--------+--------------+-------------+-----------+--------+----------+
5 rows in set (0.00 sec)

This bug also reproduced in 8.4.3.

mysql> select version();
+-----------+
| version() |
+-----------+
| 8.4.3     |
+-----------+
1 row in set (0.00 sec)

mysql> SELECT @@ngram_token_size;
+--------------------+
| @@ngram_token_size |
+--------------------+
|                  2 |
+--------------------+
1 row in set (0.00 sec)

mysql> CREATE TABLE t1 (num serial, val varchar(32), FULLTEXT KEY fts_with_ngram (val) WITH PARSER ngram);
Query OK, 0 rows affected (0.11 sec)

mysql> INSERT INTO t1 VALUES (1, '泣かないでbaby');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM t1 WHERE MATCH(val) AGAINST('baby' IN BOOLEAN MODE);
Empty set (0.00 sec)

mysql> SET GLOBAL innodb_ft_aux_table = 'd1/t1';
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT * FROM information_schema.INNODB_FT_INDEX_CACHE ORDER BY position;
+--------+--------------+-------------+-----------+--------+----------+
| WORD   | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+--------+--------------+-------------+-----------+--------+----------+
| 泣か   |            2 |           2 |         1 |      2 |        0 |
| かな   |            2 |           2 |         1 |      2 |        3 |
| ない   |            2 |           2 |         1 |      2 |        6 |
| いで   |            2 |           2 |         1 |      2 |        9 |
| でb    |            2 |           2 |         1 |      2 |       12 |
+--------+--------------+-------------+-----------+--------+----------+
5 rows in set (0.00 sec)