Bug #92705 ngram full text search - can't find words smaller than ngram_token_size
Submitted: 8 Oct 2018 10:49 Modified: 8 Oct 2018 17:17
Reporter: Tomek Marecki Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: FULLTEXT search Severity:S2 (Serious)
Version:5.7.22/8.0 OS:Ubuntu
Assigned to: CPU Architecture:x86
Tags: boolean, full text search, NGRAM

[8 Oct 2018 10:49] Tomek Marecki
Description:
Using the ngram FTS index, it's not possible to find rows, which are shorter than the ngram_token_size. Please find the below: "How to repeat" SQL queries to reproduce this problem.

Background:
I've a need to implement the: LIKE '%phrase%' search in much more efficient way.
I've decided to go try out ngram FTS index. 
I've set the `ngram_token_size` to maximum possible value (=10), so my index contains all of the ngrams.

How to repeat:
CREATE TABLE `fts` (
  `name` varchar(255) NOT NULL,
  FULLTEXT KEY `fullTextName` (`name`) WITH PARSER `ngram`
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

INSERT INTO `fts` (name) VALUES ('LONG');
INSERT INTO `fts` (name) VALUES ('LONGA');
INSERT INTO `fts` (name) VALUES ('LONGAB');
INSERT INTO `fts` (name) VALUES ('LONGABC');
INSERT INTO `fts` (name) VALUES ('LONGABCD');
INSERT INTO `fts` (name) VALUES ('LONGABCDE');
INSERT INTO `fts` (name) VALUES ('LONGABCDEF');
INSERT INTO `fts` (name) VALUES ('LONGABCDEFG');

SELECT name FROM fts WHERE MATCH (name) AGAINST ('+long*' IN boolean mode);

# This only returns:
LONGABCDEF
LONGABCDEFG

# I would expect this to returns all of the rows, since all of them have `long` inside. `long` is an ngram inside of the word in each row.
[8 Oct 2018 10:59] MySQL Verification Team
Thank you for the bug report. I couldn't repeat with current source server, so assuming it was fixed some what:

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.7.25 Source distribution BUILD: 2018-OCT-04

Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql 5.7 > CREATE DATABASE dft;
Query OK, 1 row affected (0,02 sec)

mysql 5.7 > USE dft
Database changed
mysql 5.7 > CREATE TABLE `fts` (
    ->   `name` varchar(255) NOT NULL,
    ->   FULLTEXT KEY `fullTextName` (`name`) WITH PARSER `ngram`
    -> ) ENGINE=MyISAM DEFAULT CHARSET=utf8;

INSERT INTO `fts` (name) VALUES ('LONG');
INSERT INTO `fts` (name) VALUES ('LONGA');
INSERT INTO `fts` (name) VALUES ('LONGAB');
INSERT INTO `fts` (name) VALUES ('LONGABC');
INSERT INTO `fts` (name) VALUES ('LONGABCD');
INSERT INTO `fts` (name) VALUES ('LONGABCDE');
INSERT INTO `fts` (name) VALUES ('LONGABCDEF');
Query OK, 0 rows affected (0,13 sec)

mysql 5.7 >
mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONG');
Query OK, 1 row affected (0,01 sec)

mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGA');
Query OK, 1 row affected (0,00 sec)

mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGAB');
Query OK, 1 row affected (0,00 sec)

mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGABC');
Query OK, 1 row affected (0,00 sec)

mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGABCD');
Query OK, 1 row affected (0,00 sec)

mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGABCDE');
Query OK, 1 row affected (0,00 sec)

mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGABCDEF');
Query OK, 1 row affected (0,00 sec)

mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGABCDEFG');
Query OK, 1 row affected (0,00 sec)

mysql 5.7 >
mysql 5.7 > SELECT name FROM fts WHERE MATCH (name) AGAINST ('+long*' IN boolean mode);
+-------------+
| name        |
+-------------+
| LONG        |
| LONGA       |
| LONGAB      |
| LONGABC     |
| LONGABCD    |
| LONGABCDE   |
| LONGABCDEF  |
| LONGABCDEFG |
+-------------+
8 rows in set (0,00 sec)

mysql 5.7 >
[8 Oct 2018 12:03] Tomek Marecki
I can't see easy way to check it on 5.7.25 version that you're running.

I haven't added (in the: `How to repeat:` section) that you also need to set the: `ngram_token_size` to 10 in the MySQL .cnf configuration.

Can you please try that?
[8 Oct 2018 13:01] MySQL Verification Team
Thank you for the feedback.

miguel@luz:~/dbs $ ./57c
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.7.25 Source distribution BUILD: 2018-OCT-04

Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql 5.7 > USE dft
Database changed
mysql 5.7 > SHOW VARIABLES LIKE "%ngr%";
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| ngram_token_size | 10    |
+------------------+-------+
1 row in set (0,00 sec)

mysql 5.7 > CREATE TABLE `fts` (
    ->   `name` varchar(255) NOT NULL,
    ->   FULLTEXT KEY `fullTextName` (`name`) WITH PARSER `ngram`
    -> ) ENGINE=MyISAM DEFAULT CHARSET=utf8;

INSERT INTO `fts` (name) VALUES ('LONG');
INSERT INTO `fts` (name) VALUES ('LONGA');
INSERT INTO `fts` (name) VALUES ('LONGAB');
INSERT INTO `fts` (name) VALUES ('LONGABC');
INSERT INTO `fts` (name) VALUES ('LONGABCD');
INSERT INTO `fts` (name) VALUES ('LONGABCDE');
INSERT INTO `fts` (name) VALUES ('LONGABCDEF');
INSERT INTO `fts` (name) VALUES ('LONGABCDEFG');
Query OK, 0 rows affected (0,12 sec)

mysql 5.7 >
mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONG');
Query OK, 1 row affected (0,01 sec)

mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGA');
Query OK, 1 row affected (0,00 sec)

mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGAB');
Query OK, 1 row affected (0,00 sec)

mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGABC');
Query OK, 1 row affected (0,00 sec)

mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGABCD');
Query OK, 1 row affected (0,00 sec)

mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGABCDE');
Query OK, 1 row affected (0,00 sec)

mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGABCDEF');
Query OK, 1 row affected (0,00 sec)

mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGABCDEFG');
Query OK, 1 row affected (0,00 sec)

mysql 5.7 >
mysql 5.7 > SELECT name FROM fts WHERE MATCH (name) AGAINST ('+long*' IN boolean mode);
+-------------+
| name        |
+-------------+
| LONGABCDEF  |
| LONGABCDEFG |
+-------------+
2 rows in set (0,00 sec)

mysql 5.7 >
[8 Oct 2018 17:17] Tomek Marecki
I've also tested this on MySQL 8.0.12, this bug is reproducible.