Bug #116696 NGRAM Parser and Boolean Mode Inconsistency with Chinese Punctuation
Submitted: 18 Nov 2024 13:55 Modified: 21 Nov 2024 14:13
Reporter: jianghong sheng Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: FULLTEXT search Severity:S3 (Non-critical)
Version:9.1 OS:Any
Assigned to: CPU Architecture:Any
Tags: full text search

[18 Nov 2024 13:55] jianghong sheng
Description:
When using MySQL’s Full-Text Search with the ngram parser and Boolean mode, Chinese punctuation marks (such as ,) are not correctly handled, but do return results as expected when natural language mode.

How to repeat:
1) Set global variables:

mysql> set global innodb_ft_enable_stopword=OFF;
Query OK, 0 rows affected (0.00 sec)

mysql> show variables like '%ngram_token_size%';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| ngram_token_size | 2     |
+------------------+-------+
1 row in set (0.00 sec)

2) Create test table:

CREATE TABLE `ft` (
    `id` int unsigned NOT NULL AUTO_INCREMENT,
     `text` text,
     PRIMARY KEY (`id`),
     FULLTEXT KEY `ftx_index_text` (`text`) /*!50100 WITH PARSER `ngram` */
   ) ENGINE=InnoDB  DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

3) Insert test data:

INSERT INTO `ft` (`id`, `text`) VALUES (NULL, 'a,b');

4) Check information about full-text index

It shows punctuation in the tokens.

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE;
+------+--------------+-------------+-----------+--------+----------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+------+--------------+-------------+-----------+--------+----------+
| ,b  |            2 |           2 |         1 |      2 |        1 |
| a,  |            2 |           2 |         1 |      2 |        0 |
+------+--------------+-------------+-----------+--------+----------+
2 rows in set (0.00 sec)

5)Execute search with natural language mode:

mysql> SELECT * FROM ft WHERE MATCH (text) AGAINST ('a,b');
+----+-------+
| id | text  |
+----+-------+
|  1 | a,b  |
+----+-------+
1 row in set (0.00 sec)

6) Execute search with boolean mode:

mysql> SELECT * FROM ft WHERE MATCH (text) AGAINST ('a,b' IN BOOLEAN MODE);
Empty set (0.01 sec)
[18 Nov 2024 14:15] MySQL Verification Team
Hi Mr. sheng,

Thank you for your bug report.

However, this is not a bug.

Namely, MATCH IN BOOLEAN MODE works with entirely different syntax. 

An example is the one from our Reference Manual:

SELECT * FROM articles WHERE MATCH (title,body)  AGAINST ('+MySQL -YourSQL' IN BOOLEAN MODE);

You simply did not follow the syntax.

Not a bug.
[19 Nov 2024 4:23] jianghong sheng
Thanks for your response. But boolean mode supports queries without explicit usage of these "+" "-" operators. 

As the reference manual says: 

1. By default (when neither + nor - is specified), the word is optional, but the rows that contain it are rated higher.

2. For boolean mode search, the search term is converted to an ngram phrase search.

Here is a test:

mysql> INSERT INTO `ft` (`id`, `text`) values (NULL, '你好吗');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM ft WHERE MATCH (text) AGAINST ('你好吗' in boolean mode);
+----+-----------+
| id | text      |
+----+-----------+
|  4 | 你好吗    |
+----+-----------+
1 row in set (0.00 sec)

Boolean Mode without operators, get 1 row result.
[19 Nov 2024 4:30] jianghong sheng
SELECT * FROM ft WHERE MATCH (text) AGAINST ('a,b' IN BOOLEAN MODE);

These query is expected to return 1 row.
+----+-------+
| id | text  |
+----+-------+
|  1 | a,b  |
+----+-------+
[19 Nov 2024 10:08] MySQL Verification Team
Hi Mr. sheng,

We tried to reproduce your test case, but without success.

We used our official binaries for MacOS with M3, found on https://dev.mysql.com.

Here are the results that we get and you have same results with both full-text search options:

mysql> set global innodb_ft_enable_stopword=OFF;
Query OK, 0 rows affected (0.00 sec)

mysql>
mysql> set session innodb_ft_enable_stopword=OFF;
Query OK, 0 rows affected (0.00 sec)

mysql>
mysql> show variables like '%ngram_token_size%';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| ngram_token_size | 2     |
+------------------+-------+
1 row in set (0.00 sec)

mysql>
mysql> SET NAMES utf8mb4;
Query OK, 0 rows affected (0.00 sec)

mysql>
mysql> CREATE TABLE `ft` (
    ->          `id` int unsigned NOT NULL AUTO_INCREMENT,
    ->           `text` text,
    ->           PRIMARY KEY (`id`),
    ->           FULLTEXT KEY `ftx_index_text` (`text`)  WITH PARSER `ngram`
    ->         ) ENGINE=InnoDB  DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
Query OK, 0 rows affected (0.06 sec)

mysql>
mysql> INSERT INTO `ft` (`id`, `text`) VALUES (NULL, 'a,b');
Query OK, 1 row affected (0.00 sec)

mysql>
mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE;
Empty set (0.00 sec)

mysql>
mysql> SELECT * FROM ft WHERE MATCH (text) AGAINST ('a,b');
Empty set (0.00 sec)

mysql>
mysql> SELECT * FROM ft WHERE MATCH (text) AGAINST ('a,b' IN BOOLEAN MODE);
Empty set (0.00 sec)

mysql>
mysql> select * from ft;
+----+-------+
| id | text  |
+----+-------+
|  1 | a,b  |
+----+-------+
1 row in set (0.00 sec)

mysql>
mysql> drop table ft;
Query OK, 0 rows affected (0.03 sec)

Can't repeat.
[21 Nov 2024 2:56] jianghong sheng
This issue is reproducible on my environment(M3, MacOS 14). After reinstall the latest MySQL(9.1.0) with the official binary. https://dev.mysql.com/get/Downloads/MySQL-9.1/mysql-9.1.0-macos14-arm64.dmg
[21 Nov 2024 10:12] MySQL Verification Team
Hi Mr. Sheng,

May be you have not noticed, but we did correct your test case, when we were reproducing it. Try to reproduce it with our steps , which we have provided .......
[21 Nov 2024 13:35] jianghong sheng
I followed the steps you provided completely. There is indeed a uncorrect result.

mysql> select version();
+-----------+
| version() |
+-----------+
| 9.1.0     |
+-----------+
1 row in set (0.00 sec)

mysql> set global innodb_ft_enable_stopword=OFF;
Query OK, 0 rows affected (0.00 sec)

mysql> set session innodb_ft_enable_stopword=OFF;
Query OK, 0 rows affected (0.00 sec)

mysql> SET NAMES utf8mb4;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE `ft` (
    ->              `id` int unsigned NOT NULL AUTO_INCREMENT,
    ->               `text` text,
    ->               PRIMARY KEY (`id`),
    ->              FULLTEXT KEY `ftx_index_text` (`text`)  WITH PARSER `ngram`
    ->      ) ENGINE=InnoDB  DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci; 
Query OK, 0 rows affected (0.03 sec)

mysql> INSERT INTO `ft` (`id`, `text`) VALUES (NULL, 'a,b');
Query OK, 1 row affected (0.01 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE;
Empty set (0.00 sec)

Here is no rows. Because the INNODB_SFTINDEX_CACHE table only has a value after innodb_ft_aux_table is set. So run

SET GLOBAL innodb_ft_aux_table="test/ft";

Will get 2 rows as expected.

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE;
+------+--------------+-------------+-----------+--------+----------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+------+--------------+-------------+-----------+--------+----------+
| ,b  |            2 |           2 |         1 |      2 |        1 |
| a,  |            2 |           2 |         1 |      2 |        0 |
+------+--------------+-------------+-----------+--------+----------+
2 rows in set (0.00 sec)

mysql> SELECT * FROM ft WHERE MATCH (text) AGAINST ('a,b');
+----+-------+
| id | text  |
+----+-------+
|  1 | a,b  |
+----+-------+
1 row in set (0.00 sec)

mysql> SELECT * FROM ft WHERE MATCH (text) AGAINST ('a,b' IN BOOLEAN MODE);
Empty set (0.00 sec)

I don't have any other special settings for the database.

If there is any other helpful information I can provide, please let me know.
[21 Nov 2024 13:52] MySQL Verification Team
Hi Mr. sheng,

This is what we get with our official 9.1.0 binary.

Let us know if we have missed anything:

mysql> select version();
+------------------+
| version()        |
+------------------+
| 9.1.0 |
+------------------+
1 row in set (0.00 sec)

mysql>
mysql> set global innodb_ft_enable_stopword=OFF;
Query OK, 0 rows affected (0.00 sec)

mysql>
mysql> set session innodb_ft_enable_stopword=OFF;
Query OK, 0 rows affected (0.00 sec)

mysql>
mysql> SET NAMES utf8mb4;
Query OK, 0 rows affected (0.00 sec)

mysql>
mysql> CREATE TABLE `ft` ( `id` int unsigned NOT NULL AUTO_INCREMENT, `text` text, PRIMARY KEY (`id`),  FULLTEXT KEY `ftx_index_text` (`text`)  WITH PARSER `ngram`  ) ENGINE=InnoDB  DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
Query OK, 0 rows affected (0.04 sec)

mysql>
mysql> INSERT INTO `ft` (`id`, `text`) VALUES (NULL, 'a,b');
Query OK, 1 row affected (0.00 sec)

mysql>
mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE;
Empty set (0.00 sec)

mysql>
mysql> SET GLOBAL innodb_ft_aux_table="test/ft";
Query OK, 0 rows affected (0.00 sec)

mysql>
mysql>
mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE;
Empty set (0.00 sec)

mysql>
mysql> SELECT * FROM ft WHERE MATCH (text) AGAINST ('a,b');
Empty set (0.00 sec)

mysql>
mysql> SELECT * FROM ft WHERE MATCH (text) AGAINST ('a,b' IN BOOLEAN MODE);
Empty set (0.00 sec)
[21 Nov 2024 13:55] MySQL Verification Team
Hi Mr. sheng,

We repeated your results with 8.4.3, though:

mysql> select version();
+------------------+
| version()        |
+------------------+
| 8.4.3|
+------------------+
1 row in set (0.00 sec)

mysql>
mysql> set global innodb_ft_enable_stopword=OFF;
Query OK, 0 rows affected (0.00 sec)

mysql>
mysql> set session innodb_ft_enable_stopword=OFF;
Query OK, 0 rows affected (0.00 sec)

mysql>
mysql> SET NAMES utf8mb4;
Query OK, 0 rows affected (0.00 sec)

mysql>
mysql> CREATE TABLE `ft` ( `id` int unsigned NOT NULL AUTO_INCREMENT, `text` text, PRIMARY KEY (`id`),  FULLTEXT KEY `ftx_index_text` (`text`)  WITH PARSER `ngram`  ) ENGINE=InnoDB  DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
Query OK, 0 rows affected (0.04 sec)

mysql>
mysql> INSERT INTO `ft` (`id`, `text`) VALUES (NULL, 'a,b');
Query OK, 1 row affected (0.00 sec)

mysql>
mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE;
+------+--------------+-------------+-----------+--------+----------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+------+--------------+-------------+-----------+--------+----------+
| ,b  |            2 |           2 |         1 |      2 |        1 |
| a,  |            2 |           2 |         1 |      2 |        0 |
+------+--------------+-------------+-----------+--------+----------+
2 rows in set (0.00 sec)

mysql>
mysql> SET GLOBAL innodb_ft_aux_table="test/ft";
Query OK, 0 rows affected (0.00 sec)

mysql>
mysql>
mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE;
+------+--------------+-------------+-----------+--------+----------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+------+--------------+-------------+-----------+--------+----------+
| ,b  |            2 |           2 |         1 |      2 |        1 |
| a,  |            2 |           2 |         1 |      2 |        0 |
+------+--------------+-------------+-----------+--------+----------+
2 rows in set (0.00 sec)

mysql>
mysql> SELECT * FROM ft WHERE MATCH (text) AGAINST ('a,b');
+----+-------+
| id | text  |
+----+-------+
|  1 | a,b  |
+----+-------+
1 row in set (0.00 sec)

mysql>
mysql> SELECT * FROM ft WHERE MATCH (text) AGAINST ('a,b' IN BOOLEAN MODE);
Empty set (0.00 sec)
[21 Nov 2024 13:59] MySQL Verification Team
Hi Mr. sheng,

We could not repeat your test case with 8.0.40 either ......

Hence, this is now a verified bug that affects latest 8.4 and 9.0 ONLY !!!!!!

Verified as reported.
[21 Nov 2024 14:13] jianghong sheng
It's strange, our steps are all the same. I'm not sure what happened.