Bug #86164 Fulltext search can not find word which contains punctuation marks
Submitted: 3 May 2017 3:43 Modified: 4 May 2017 12:24
Reporter: Charly Batista Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: FULLTEXT search Severity:S2 (Serious)
Version:5.7.17, 5.7.18 OS:Any
Assigned to: CPU Architecture:Any

[3 May 2017 3:43] Charly Batista
Description:
Fulltext search can not find word which contains punctuation marks like "!?,.{}/" in boolean mode BUT it does in natural language mode.

There is one case reported for "." and "," but according our tests it includes much more punctuation marks.

The reported case is https://bugs.mysql.com/bug.php?id=85876 

How to repeat:
The problem can be reproduced as follows:

================================= 
CREATE TABLE `test` ( 
`seq` int(11) NOT NULL AUTO_INCREMENT, 
`cont` text COLLATE utf8mb4_bin, 
PRIMARY KEY (`seq`), 
FULLTEXT KEY `fx_txts` (`cont`) /*!50100 WITH PARSER `ngram` */ 
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;

insert into test (cont) values ('aBc!efg'); 
insert into test (cont) values ('aBc@efg'); 
insert into test (cont) values ('aBc#efg'); 
insert into test (cont) values ('aBc$efg'); 
insert into test (cont) values ('aBc%efg'); 
insert into test (cont) values ('abc^efg'); 
insert into test (cont) values ('abc&efg'); 
insert into test (cont) values ('abc*efg'); 
insert into test (cont) values ('abc(efg'); 
insert into test (cont) values ('abc)efg');

Then try to select whatever value using boolean mode:
mysql [localhost] {msandbox} (test) > select * from test where match(cont) against ('"c!e"' in boolean mode); 
Empty set (0.00 sec)

mysql [localhost] {msandbox} (test) > select * from test where match(cont) against ('"c#e"' in boolean mode); 
Empty set (0.00 sec)

mysql [localhost] {msandbox} (test) > select * from test where match(cont) against ('"c$e"' in boolean mode); 
Empty set (0.00 sec)

mysql [localhost] {msandbox} (test) > select * from test where match(cont) against ('"c!e"' in boolean mode);
Empty set (0.00 sec)

If change to natural language mode it works:
mysql [localhost] {msandbox} (test) > select * from test where match(cont) against ('"c!e"' in natural language mode);
+-----+---------+
| seq | cont    |
+-----+---------+
|   1 | aBc!efg |
|  11 | abc!efg |
+-----+---------+
2 rows in set (0.00 sec)

The index entries are created as can be seen in INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE:

mysql [localhost] {msandbox} (test) > SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE;                                                                                                                                                 +------+--------------+-------------+-----------+--------+----------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+------+--------------+-------------+-----------+--------+----------+
| !e   |            2 |          12 |         2 |      2 |        3 |
| !e   |            2 |          12 |         2 |     12 |        3 |
| #e   |            4 |           4 |         1 |      4 |        3 |
| $e   |            5 |           5 |         1 |      5 |        3 |
| %e   |            6 |           6 |         1 |      6 |        3 |
| &e   |            8 |           8 |         1 |      8 |        3 |
| (e   |           10 |          10 |         1 |     10 |        3 |
| )e   |           11 |          11 |         1 |     11 |        3 |
| *e   |            9 |           9 |         1 |      9 |        3 |
| @e   |            3 |           3 |         1 |      3 |        3 |
| ^e   |            7 |           7 |         1 |      7 |        3 |
<...>
| c!   |            2 |          12 |         2 |      2 |        2 |
| c!   |            2 |          12 |         2 |     12 |        2 |
| c#   |            4 |           4 |         1 |      4 |        2 |
| c$   |            5 |           5 |         1 |      5 |        2 |
| c%   |            6 |           6 |         1 |      6 |        2 |
| c&   |            8 |           8 |         1 |      8 |        2 |
| c(   |           10 |          10 |         1 |     10 |        2 |
| c)   |           11 |          11 |         1 |     11 |        2 |
| c*   |            9 |           9 |         1 |      9 |        2 |
| c@   |            3 |           3 |         1 |      3 |        2 |
| c^   |            7 |           7 |         1 |      7 |        2 |
<...>
+------+--------------+-------------+-----------+--------+----------+
55 rows in set (0.00 sec)
[4 May 2017 12:24] Umesh Shastry
Hello Charly Batista,

Thank you for the report and test case.

Thanks,
Umesh
[8 May 2017 11:49] Sveta Smirnova
ngram parser is not required to repeat this bug. It is repeatable with MyISAM as well.