Bug #85876 Fulltext search can not find word which contains "," or "."
Submitted: 10 Apr 2017 8:46 Modified: 24 May 10:31
Reporter: Seunguck Lee Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: FULLTEXT search Severity:S3 (Non-critical)
Version:5.7.17, 5.7.18 OS:Any
Assigned to: CPU Architecture:Any
Tags: boolean, fulltext, NGRAM

[10 Apr 2017 8:46] Seunguck Lee
Description:
Fulltext search can not find word which contains "," or "." in boolean mode BUT natural language mode.

Looks like "." and "," are not a special operator character in boolean mode according to the manual (https://dev.mysql.com/doc/refman/5.7/en/fulltext-boolean.html).
And also ngram parser add words which contains "," and "." to fulltext index.

What is different between natural language and boolean mode ?

And one more thing is that I don't think indexing "," and "." (and including all punctuation marks) are useful. On current implementation, fts index have these punctuation marks but can't search. I think this implementation makes fts index bigger but not useful.

How to repeat:
mysql> show variables like 'ngram_token_size';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| ngram_token_size | 2     |
+------------------+-------+

mysql> set global innodb_ft_enable_stopword=off;
Query OK, 0 rows affected (0.00 sec)

mysql> set innodb_ft_enable_stopword=off;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE ft_test3(
    ->   id int not null,
    ->   contents text,
    ->   primary key(id),
    ->   fulltext index fx_contents(contents) with parser ngram
    -> ) engine=innodb;
Query OK, 0 rows affected (0.01 sec)

mysql> set global innodb_ft_aux_table='test/ft_test3';
Query OK, 0 rows affected (0.00 sec)

mysql> insert into ft_test3 values (1, 'xy,yz');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE order by position;
+------+--------------+-------------+-----------+--------+----------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+------+--------------+-------------+-----------+--------+----------+
| xy   |            2 |           2 |         1 |      2 |        0 |
| y,   |            2 |           2 |         1 |      2 |        1 |
| ,y   |            2 |           2 |         1 |      2 |        2 |
| yz   |            2 |           2 |         1 |      2 |        3 |
+------+--------------+-------------+-----------+--------+----------+
4 rows in set (0.00 sec)

mysql> select * from ft_test3 where match(contents) against('y,' in boolean mode);
Empty set (0.00 sec)

mysql> select * from ft_test3 where match(contents) against(',y' in boolean mode);
Empty set (0.00 sec)

mysql> select * from ft_test3 where match(contents) against('y,' in natural language mode);
+----+----------+
| id | contents |
+----+----------+
|  1 | xy,yz    |
+----+----------+
1 row in set (0.00 sec)

mysql> select * from ft_test3 where match(contents) against(',y' in natural language mode);
+----+----------+
| id | contents |
+----+----------+
|  1 | xy,yz    |
+----+----------+
1 row in set (0.00 sec)

Suggested fix:
Prevent indexing word which has punctuation marks from fts index.
[11 Apr 2017 6:04] Umesh Shastry
Hello Seunguck Lee,

Thank you for the report and test case.

Thanks,
Umesh
[11 Apr 2017 6:05] Umesh Shastry
##

Master>show variables like 'ngram_token_size';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| ngram_token_size | 2     |
+------------------+-------+
1 row in set (0.01 sec)

Master>set global innodb_ft_enable_stopword=off;
Query OK, 0 rows affected (0.00 sec)

Master>set innodb_ft_enable_stopword=off;
Query OK, 0 rows affected (0.00 sec)

Master>CREATE TABLE ft_test3(
    ->   id int not null,
    ->    contents text,
    ->    primary key(id),fulltext index fx_contents(contents) with parser ngram
    ->  ) engine=innodb;
Query OK, 0 rows affected (0.01 sec)

Master>set global innodb_ft_aux_table='test/ft_test3';
Query OK, 0 rows affected (0.00 sec)

Master> insert into ft_test3 values (1, 'xy,yz');
Query OK, 1 row affected (0.00 sec)

Master>SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE order by position;
+------+--------------+-------------+-----------+--------+----------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+------+--------------+-------------+-----------+--------+----------+
| xy   |            2 |           2 |         1 |      2 |        0 |
| y,   |            2 |           2 |         1 |      2 |        1 |
| ,y   |            2 |           2 |         1 |      2 |        2 |
| yz   |            2 |           2 |         1 |      2 |        3 |
+------+--------------+-------------+-----------+--------+----------+
4 rows in set (0.00 sec)

Master>select * from ft_test3 where match(contents) against('y,' in boolean mode);
Empty set (0.00 sec)

Master>select * from ft_test3 where match(contents) against(',y' in boolean mode);
Empty set (0.00 sec)

Master> select * from ft_test3 where match(contents) against('y,' in natural language mode);
+----+----------+
| id | contents |
+----+----------+
|  1 | xy,yz    |
+----+----------+
1 row in set (0.00 sec)

Master>select * from ft_test3 where match(contents) against(',y' in natural language mode);
+----+----------+
| id | contents |
+----+----------+
|  1 | xy,yz    |
+----+----------+
1 row in set (0.00 sec)

Master>show variables like '%version%';
+-------------------------+------------------------------+
| Variable_name           | Value                        |
+-------------------------+------------------------------+
| innodb_version          | 5.7.18                       |
| protocol_version        | 10                           |
| slave_type_conversions  |                              |
| tls_version             | TLSv1,TLSv1.1                |
| version                 | 5.7.18-log                   |
| version_comment         | MySQL Community Server (GPL) |
| version_compile_machine | x86_64                       |
| version_compile_os      | linux-glibc2.5               |
+-------------------------+------------------------------+
8 rows in set (0.00 sec)

Master>
[24 May 10:31] Daniel Price
Posted by developer:
 
Fixed as of the upcoming 5.7.23, 8.0.12 release, and here's the changelog entry:

The ngram full-text search parser permitted comma and period characters
to be tokenized as words, which caused an inconsistency between boolean
and natural language mode search results. Comma and period characters are
no longer tokenized.