| Bug #92705 | ngram full text search - can't find words smaller than ngram_token_size | ||
|---|---|---|---|
| Submitted: | 8 Oct 2018 10:49 | Modified: | 8 Oct 2018 17:17 |
| Reporter: | Tomek Marecki | Email Updates: | |
| Status: | Verified | Impact on me: | |
| Category: | MySQL Server: FULLTEXT search | Severity: | S2 (Serious) |
| Version: | 5.7.22/8.0 | OS: | Ubuntu |
| Assigned to: | CPU Architecture: | x86 | |
| Tags: | boolean, full text search, NGRAM | ||
[8 Oct 2018 10:59]
MySQL Verification Team
Thank you for the bug report. I couldn't repeat with current source server, so assuming it was fixed some what:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.7.25 Source distribution BUILD: 2018-OCT-04
Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql 5.7 > CREATE DATABASE dft;
Query OK, 1 row affected (0,02 sec)
mysql 5.7 > USE dft
Database changed
mysql 5.7 > CREATE TABLE `fts` (
-> `name` varchar(255) NOT NULL,
-> FULLTEXT KEY `fullTextName` (`name`) WITH PARSER `ngram`
-> ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
INSERT INTO `fts` (name) VALUES ('LONG');
INSERT INTO `fts` (name) VALUES ('LONGA');
INSERT INTO `fts` (name) VALUES ('LONGAB');
INSERT INTO `fts` (name) VALUES ('LONGABC');
INSERT INTO `fts` (name) VALUES ('LONGABCD');
INSERT INTO `fts` (name) VALUES ('LONGABCDE');
INSERT INTO `fts` (name) VALUES ('LONGABCDEF');
Query OK, 0 rows affected (0,13 sec)
mysql 5.7 >
mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONG');
Query OK, 1 row affected (0,01 sec)
mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGA');
Query OK, 1 row affected (0,00 sec)
mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGAB');
Query OK, 1 row affected (0,00 sec)
mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGABC');
Query OK, 1 row affected (0,00 sec)
mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGABCD');
Query OK, 1 row affected (0,00 sec)
mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGABCDE');
Query OK, 1 row affected (0,00 sec)
mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGABCDEF');
Query OK, 1 row affected (0,00 sec)
mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGABCDEFG');
Query OK, 1 row affected (0,00 sec)
mysql 5.7 >
mysql 5.7 > SELECT name FROM fts WHERE MATCH (name) AGAINST ('+long*' IN boolean mode);
+-------------+
| name |
+-------------+
| LONG |
| LONGA |
| LONGAB |
| LONGABC |
| LONGABCD |
| LONGABCDE |
| LONGABCDEF |
| LONGABCDEFG |
+-------------+
8 rows in set (0,00 sec)
mysql 5.7 >
[8 Oct 2018 12:03]
Tomek Marecki
I can't see easy way to check it on 5.7.25 version that you're running. I haven't added (in the: `How to repeat:` section) that you also need to set the: `ngram_token_size` to 10 in the MySQL .cnf configuration. Can you please try that?
[8 Oct 2018 13:01]
MySQL Verification Team
Thank you for the feedback.
miguel@luz:~/dbs $ ./57c
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.7.25 Source distribution BUILD: 2018-OCT-04
Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql 5.7 > USE dft
Database changed
mysql 5.7 > SHOW VARIABLES LIKE "%ngr%";
+------------------+-------+
| Variable_name | Value |
+------------------+-------+
| ngram_token_size | 10 |
+------------------+-------+
1 row in set (0,00 sec)
mysql 5.7 > CREATE TABLE `fts` (
-> `name` varchar(255) NOT NULL,
-> FULLTEXT KEY `fullTextName` (`name`) WITH PARSER `ngram`
-> ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
INSERT INTO `fts` (name) VALUES ('LONG');
INSERT INTO `fts` (name) VALUES ('LONGA');
INSERT INTO `fts` (name) VALUES ('LONGAB');
INSERT INTO `fts` (name) VALUES ('LONGABC');
INSERT INTO `fts` (name) VALUES ('LONGABCD');
INSERT INTO `fts` (name) VALUES ('LONGABCDE');
INSERT INTO `fts` (name) VALUES ('LONGABCDEF');
INSERT INTO `fts` (name) VALUES ('LONGABCDEFG');
Query OK, 0 rows affected (0,12 sec)
mysql 5.7 >
mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONG');
Query OK, 1 row affected (0,01 sec)
mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGA');
Query OK, 1 row affected (0,00 sec)
mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGAB');
Query OK, 1 row affected (0,00 sec)
mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGABC');
Query OK, 1 row affected (0,00 sec)
mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGABCD');
Query OK, 1 row affected (0,00 sec)
mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGABCDE');
Query OK, 1 row affected (0,00 sec)
mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGABCDEF');
Query OK, 1 row affected (0,00 sec)
mysql 5.7 > INSERT INTO `fts` (name) VALUES ('LONGABCDEFG');
Query OK, 1 row affected (0,00 sec)
mysql 5.7 >
mysql 5.7 > SELECT name FROM fts WHERE MATCH (name) AGAINST ('+long*' IN boolean mode);
+-------------+
| name |
+-------------+
| LONGABCDEF |
| LONGABCDEFG |
+-------------+
2 rows in set (0,00 sec)
mysql 5.7 >
[8 Oct 2018 17:17]
Tomek Marecki
I've also tested this on MySQL 8.0.12, this bug is reproducible.

Description: Using the ngram FTS index, it's not possible to find rows, which are shorter than the ngram_token_size. Please find the below: "How to repeat" SQL queries to reproduce this problem. Background: I've a need to implement the: LIKE '%phrase%' search in much more efficient way. I've decided to go try out ngram FTS index. I've set the `ngram_token_size` to maximum possible value (=10), so my index contains all of the ngrams. How to repeat: CREATE TABLE `fts` ( `name` varchar(255) NOT NULL, FULLTEXT KEY `fullTextName` (`name`) WITH PARSER `ngram` ) ENGINE=MyISAM DEFAULT CHARSET=utf8; INSERT INTO `fts` (name) VALUES ('LONG'); INSERT INTO `fts` (name) VALUES ('LONGA'); INSERT INTO `fts` (name) VALUES ('LONGAB'); INSERT INTO `fts` (name) VALUES ('LONGABC'); INSERT INTO `fts` (name) VALUES ('LONGABCD'); INSERT INTO `fts` (name) VALUES ('LONGABCDE'); INSERT INTO `fts` (name) VALUES ('LONGABCDEF'); INSERT INTO `fts` (name) VALUES ('LONGABCDEFG'); SELECT name FROM fts WHERE MATCH (name) AGAINST ('+long*' IN boolean mode); # This only returns: LONGABCDEF LONGABCDEFG # I would expect this to returns all of the rows, since all of them have `long` inside. `long` is an ngram inside of the word in each row.