Bug #72548 Deleted documents not getting skipped when doing IDF calculations
Submitted: 6 May 2014 12:42 Modified: 11 Jun 2014 12:10
Reporter: Matthew Lord Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S2 (Serious)
Version:mysql-5.6.18 OS:Linux (OL 6.5 x86_64)
Assigned to: Jimmy Yang CPU Architecture:Any
Tags: fts, full-text, IDF, ranking, rbt, red-black

[6 May 2014 12:42] Matthew Lord
Description:
When calculating the IDF (inverse document frequency) value used in the relevancy rankings, deleted documents are not getting properly skipped.

I first verified the issue in 5.6.15-Enterprise, and examined the results with gdb. 

I then verified with 5.6.18-Enterprise, and I got the same results.

I used OL 6.5 on x86_64 for my tests.

How to repeat:
CREATE TABLE articles (
       id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
       title VARCHAR(200),
       body TEXT,
       FULLTEXT (title,body)
  ) ENGINE=InnoDB;
 
INSERT INTO articles (title,body) VALUES ('Test Article','blah blah blah'),("Matt's Noise",'this is noisy'),('February Weather','It was terrible this year.'),('Peter Pan','Tis a kids story.'),('Test1','nada'),('Database database database','foo database database database'),('Database article title','body with lots of words.'),('myfulltext database', 'my test fulltext database');                    

SELECT id, title, body, MATCH (title,body)  AGAINST ('database' IN BOOLEAN MODE) AS score FROM articles ORDER BY score DESC;
DELETE from articles;
COMMIT;

INSERT INTO articles (title,body) VALUES ('Test Article','blah blah blah'),("Matt's Noise",'this is noisy'),('February Weather','It was terrible this year.'),('Peter Pan','Tis a kids story.'),('Test1','nada'),('Database database database','foo database database database'),('Database article title','body with lots of words.'),('myfulltext database', 'my test fulltext database');   
SELECT id, title, body, MATCH (title,body)  AGAINST ('database' IN BOOLEAN MODE) AS score FROM articles ORDER BY score DESC;
DELETE from articles;
COMMIT;

INSERT INTO articles (title,body) VALUES ('Test Article','blah blah blah'),("Matt's Noise",'this is noisy'),('February Weather','It was terrible this year.'),('Peter Pan','Tis a kids story.'),('Test1','nada'),('Database database database','foo database database database'),('Database article title','body with lots of words.'),('myfulltext database', 'my test fulltext database');   
SELECT id, title, body, MATCH (title,body)  AGAINST ('database' IN BOOLEAN MODE) AS score FROM articles ORDER BY score DESC;

We have 8 records in total, with 3 of them matching the search term “database”, and the first record  (the one with the highest relative relevancy ranking) contains the search term 6 times. The relevancy ranking should be consistent at 1.0886961221694946, but it's not. The deleted documents are factoring into the rankings as they're not properly getting skipped when we do the relevancy ranking IDF calculations. 

Suggested fix:
We should skip deleted documents when calculating the IDF value. I see in the code that this is the intention, so we've just got a bug somewhere.
[11 Jun 2014 12:10] Daniel Price
Fixed as of the upcoming 5.6.20, 5.7.5 release, and here's the changelog entry:

For single item full-text searches, deleted documents were included
inverse document frequency (IDF) calculations. 

Thank you for the bug report.
[7 Aug 2014 5:46] Laurynas Biveinis
$ bzr log -n0 -r 5985
------------------------------------------------------------
revno: 5985
committer: Jimmy Yang <jimmy.yang@oracle.com>
branch nick: mysql-5.6
timestamp: Tue 2014-06-10 18:50:10 +0800
message:
  Fix Bug #18711306 - DELETED DOCUMENTS NOT GETTING SKIPPED WHEN DOING IDF
  CALCULATIONS 
  
  rb://5426 approved by Sunny Bains
[7 Aug 2014 5:48] Laurynas Biveinis
$ bzr log -n0 -r 5988
------------------------------------------------------------
revno: 5988
committer: Jimmy Yang <jimmy.yang@oracle.com>
branch nick: mysql-5.6
timestamp: Thu 2014-06-12 12:05:19 +0800
message:
  Fix testcase for Bug #18711306. The table row count is statistics value,
  so it could vary, result in indeterministic results.
[7 Aug 2014 6:54] Laurynas Biveinis
$ bzr log -n0 -r 6037
------------------------------------------------------------
revno: 6037
committer: Jimmy Yang <jimmy.yang@oracle.com>
branch nick: mysql-5.6
timestamp: Fri 2014-06-27 19:07:21 +0800
message:
  Back port fixes for innodb_fts_misc.test from trunk, issue introduced by
  bug #18711306 checkin