Bug #76210 InnoDB FULLTEXT index returns wrong results for key/value pair documents
Submitted: 7 Mar 2015 21:58 Modified: 9 Mar 2015 9:00
Reporter: Justin Swanhart Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: FULLTEXT search Severity:S2 (Serious)
Version:5.6.22, 5.6.25, 5.7.7 OS:Any
Assigned to: CPU Architecture:Any
Tags: fulltext, innodb, innodb fulltext, wrong results

[7 Mar 2015 21:58] Justin Swanhart
Description:
I'm working on methods for indexing JSON for MySQL databases.  I have a UDF that converts JSON documents into key/value pair documents (as a prototype for a parser plugin).  Flatting the JSON to key/value pairs makes it possible to search in nested JSON documents.  Unfortunately, InnoDB FULLTEXT indexes are not handling the flattened documents properly, and a simple search returns tens of thousands of results when only three results should be returned.

How to repeat:
mysql> create table flat(flat mediumtext) engine=innodb;
Query OK, 0 rows affected (0.02 sec)

mysql> load data infile 'flat.txt' into table flat fields enclosed by '"';
Query OK, 18801 rows affected (5.94 sec)
Records: 18801  Deleted: 0  Skipped: 0  Warnings: 0

mysql> alter table flat add fulltext(flat);
Query OK, 0 rows affected, 1 warning (31.17 sec)
Records: 0  Duplicates: 0  Warnings: 1

mysql> select count(*) from flat 
where match(flat) against ('last_name=Vembu') ;
+----------+
| count(*) |
+----------+
|    13625 |
+----------+
1 row in set (0.01 sec)

mysql> alter table flat engine=myisam;
Query OK, 18801 rows affected (9.28 sec)
Records: 18801  Duplicates: 0  Warnings: 0

mysql> select count(*) from flat where match(flat) against ('last_name=Vembu') ;
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.16 sec)
[7 Mar 2015 22:03] Justin Swanhart
Uploaded to sftp/incoming: mysql-bug-data-76210.flat.txt.xz
[8 Mar 2015 8:09] Justin Swanhart
Moving to S2 because I found a workaround:
match(flat) against('+last_name=Vembu' in boolean mode) works properly
[9 Mar 2015 3:58] Justin Swanhart
It is interesting to note that MyISAM operates the opposite way that InnoDB does - when using binary mode it returns lots of results.

It seems that = is treated as a word separator unless you make a custom collation.

Regardless behavior should, however, be consistent between MyISAM and InnoDB for the same data and same queries so this is still a bug.
[9 Mar 2015 9:00] Umesh Shastry
Hello Justin Swanhart,

Thank you for the report and test case.
Observed reported behavior with 5.6.25 build.

Thanks,
Umesh
[9 Mar 2015 9:01] Umesh Shastry
// 5.6.25

 bin/mysql -uroot -p -S /tmp/mysql_ushastry.sock
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.6.25-enterprise-commercial-advanced MySQL Enterprise Server - Advanced Edition (Commercial)

Copyright (c) 2000, 2015, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> use test
Database changed
mysql> create table flat(flat mediumtext) engine=innodb;
Query OK, 0 rows affected (0.01 sec)

mysql> load data infile '/export/umesh/mysql-5.6.25/flat.txt' into table flat fields enclosed by '"';
Query OK, 18801 rows affected (2.09 sec)
Records: 18801  Deleted: 0  Skipped: 0  Warnings: 0

mysql> alter table flat add fulltext(flat);
Query OK, 0 rows affected, 1 warning (17.70 sec)
Records: 0  Duplicates: 0  Warnings: 1

mysql> show warnings;
+---------+------+--------------------------------------------------+
| Level   | Code | Message                                          |
+---------+------+--------------------------------------------------+
| Warning |  124 | InnoDB rebuilding table to add column FTS_DOC_ID |
+---------+------+--------------------------------------------------+
1 row in set (0.00 sec)

mysql> select count(*) from flat
    -> where match(flat) against ('last_name=Vembu') ;
+----------+
| count(*) |
+----------+
|    13625 |
+----------+
1 row in set (0.01 sec)

mysql> alter table flat engine=myisam;
Query OK, 18801 rows affected (9.62 sec)
Records: 18801  Duplicates: 0  Warnings: 0

mysql> select count(*) from flat where match(flat) against ('last_name=Vembu') ;
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.03 sec)
[9 Mar 2015 9:07] Umesh Shastry
// 5.7.7

mysql> create table flat(flat mediumtext) engine=innodb;
Query OK, 0 rows affected (0.01 sec)

mysql> load data infile '/export/umesh/mysql-5.6.25/flat.txt' into table flat fields enclosed by '"';
Query OK, 18801 rows affected (16.90 sec)
Records: 18801  Deleted: 0  Skipped: 0  Warnings: 0

mysql> alter table flat add fulltext(flat);
Query OK, 0 rows affected, 1 warning (16.84 sec)
Records: 0  Duplicates: 0  Warnings: 1

mysql> select count(*) from flat
    -> where match(flat) against ('last_name=Vembu') ;
+----------+
| count(*) |
+----------+
|    13625 |
+----------+
1 row in set (0.02 sec)

mysql> alter table flat engine=myisam;
Query OK, 18801 rows affected (9.14 sec)
Records: 18801  Duplicates: 0  Warnings: 0

mysql> select count(*) from flat where match(flat) against ('last_name=Vembu') ;
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.03 sec)

mysql> show variables like '%version%';
+-------------------------+---------------------------------------------------------+
| Variable_name           | Value                                                   |
+-------------------------+---------------------------------------------------------+
| innodb_version          | 5.7.7                                                   |
| protocol_version        | 10                                                      |
| slave_type_conversions  |                                                         |
| version                 | 5.7.7-rc-enterprise-commercial-advanced                 |
| version_comment         | MySQL Enterprise Server - Advanced Edition (Commercial) |
| version_compile_machine | x86_64                                                  |
| version_compile_os      | Linux                                                   |
+-------------------------+---------------------------------------------------------+
7 rows in set (0.00 sec)