Bug #91437 FULLTEXT index with ngram parser returns unexpected reults
Submitted: 27 Jun 2018 12:04 Modified: 28 Jun 2018 9:12
Reporter: Sylvain Estevez Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: FULLTEXT search Severity:S2 (Serious)
Version:5.7.19, 5.7.22 OS:MacOS
Assigned to: CPU Architecture:Any
Tags: fulltext, INDEX, NGRAM, search

[27 Jun 2018 12:04] Sylvain Estevez
Description:
We use a fulltext index with ngram parser to serach for partial matches on user emails. While the featire works very well, we recently ran into a case where the search returns unexpected results.

With the following emails in the DB:
- 'nicolas.lich@toto.com'
- 'Katlynn.Connelly@gmail.com'
- 'nicolasch@foo.com'
- 'alexbraico@me.com'
- 'sylvain+old@flux.com'
- '6@flux.com'

A query like `SELECT `email` FROM `ngram_test` WHERE ((MATCH(`email`) AGAINST('+nico' IN BOOLEAN MODE)));` returns all the users, which is incorrect.

A query like `SELECT `email` FROM `ngram_test` WHERE ((MATCH(`email`) AGAINST('+sylv' IN BOOLEAN MODE)));` returns only one user, which is correct.

See below for more details on reproducing.

How to repeat:
```
CREATE TABLE `ngram_test` (
  email VARCHAR(150) NOT NULL,
  PRIMARY KEY (`email`),
  FULLTEXT KEY `email_ngram` (`email`) /*!50100 WITH PARSER `ngram` */
);

INSERT INTO `ngram_test` (`email`) VALUES ('nicolas.lich@toto.com'), ('Katlynn.Connelly@gmail.com'), ('nicolasch@foo.com'), ('alexbraico@me.com'), ('sylvain+old@flux.com'), ('6@flux.com');

/* This should match values that contain 'ni' AND 'ic' AND 'co' */
SELECT `email`, (MATCH(`email`) AGAINST('+nico' IN BOOLEAN MODE)) as `score` FROM `ngram_test` WHERE ((MATCH(`email`) AGAINST('+nico' IN BOOLEAN MODE)));
```

Additionally:
> SHOW VARIABLES WHERE Variable_name LIKE '%ngram%';

{
	"data":
	[
		{
			"Variable_name": "ngram_token_size",
			"Value": "2"
		}
	]
}

> SHOW VARIABLES WHERE Variable_name LIKE '%ft%';

{
	"data":
	[
		{
			"Variable_name": "ft_boolean_syntax",
			"Value": "+ -><()~*:\"\"&|"
		},
		{
			"Variable_name": "ft_max_word_len",
			"Value": "84"
		},
		{
			"Variable_name": "ft_min_word_len",
			"Value": "4"
		},
		{
			"Variable_name": "ft_query_expansion_limit",
			"Value": "20"
		},
		{
			"Variable_name": "ft_stopword_file",
			"Value": "(built-in)"
		},
		{
			"Variable_name": "innodb_ft_aux_table",
			"Value": "auth_service/user"
		},
		{
			"Variable_name": "innodb_ft_cache_size",
			"Value": "8000000"
		},
		{
			"Variable_name": "innodb_ft_enable_diag_print",
			"Value": "OFF"
		},
		{
			"Variable_name": "innodb_ft_enable_stopword",
			"Value": "ON"
		},
		{
			"Variable_name": "innodb_ft_max_token_size",
			"Value": "84"
		},
		{
			"Variable_name": "innodb_ft_min_token_size",
			"Value": "3"
		},
		{
			"Variable_name": "innodb_ft_num_word_optimize",
			"Value": "2000"
		},
		{
			"Variable_name": "innodb_ft_result_cache_limit",
			"Value": "2000000000"
		},
		{
			"Variable_name": "innodb_ft_server_stopword_table",
			"Value": ""
		},
		{
			"Variable_name": "innodb_ft_sort_pll_degree",
			"Value": "2"
		},
		{
			"Variable_name": "innodb_ft_total_cache_size",
			"Value": "640000000"
		},
		{
			"Variable_name": "innodb_ft_user_stopword_table",
			"Value": ""
		}
	]
}
[27 Jun 2018 13:34] Sylvain Estevez
Given the table in the reproduction steps, and adding a row with `david` as email value: searching for `david` does not return any results.
[28 Jun 2018 1:24] Tsubasa Tanaka
I think this relates Bug#82330.
Workaround is setting empty stopword table.

mysql57> CREATE TABLE stopwords (value varchar(255) NOT NULL PRIMARY KEY);
Query OK, 0 rows affected (0.16 sec)

mysql57> SET GLOBAL innodb_ft_server_stopword_table = 'd1/stopwords';
Query OK, 0 rows affected (0.02 sec)

https://bugs.mysql.com/bug.php?id=82330

(From Japan MySQL User Group)
[28 Jun 2018 9:12] MySQL Verification Team
Hello Sylvain,

Thank you for the report and test case.
This is most likely duplicate of Bug #82330, please see Bug #82330
As Tanaka-San suggested, work around is

#####
mysql> CREATE TABLE stopwords (value varchar(255) NOT NULL PRIMARY KEY);
Query OK, 0 rows affected (0.01 sec)

mysql> SET GLOBAL innodb_ft_server_stopword_table = 'd1/stopwords';
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE `ngram_test` (
    ->   email VARCHAR(150) NOT NULL,
    ->   PRIMARY KEY (`email`),
    ->   FULLTEXT KEY `email_ngram` (`email`) /*!50100 WITH PARSER `ngram` */
    -> );
Query OK, 0 rows affected (0.04 sec)

mysql> INSERT INTO `ngram_test` (`email`) VALUES ('nicolas.lich@toto.com'), ('Katlynn.Connelly@gmail.com'), ('nicolasch@foo.com'), ('alexbraico@me.com'), ('sylvain+old@flux.com'), ('6@flux.com');
Query OK, 6 rows affected (0.01 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> SELECT `email` FROM `ngram_test` WHERE ((MATCH(`email`) AGAINST('+nico' IN BOOLEAN MODE)));
+-----------------------+
| email                 |
+-----------------------+
| nicolas.lich@toto.com |
| nicolasch@foo.com     |
+-----------------------+
2 rows in set (0.00 sec)

mysql> SELECT `email` FROM `ngram_test` WHERE ((MATCH(`email`) AGAINST('+sylv' IN BOOLEAN MODE)));
+----------------------+
| email                |
+----------------------+
| sylvain+old@flux.com |
+----------------------+
1 row in set (0.00 sec)

Thanks,
Umesh