| Bug #72548 | Deleted documents not getting skipped when doing IDF calculations | ||
|---|---|---|---|
| Submitted: | 6 May 2014 12:42 | Modified: | 11 Jun 2014 12:10 |
| Reporter: | Matthew Lord | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Server: InnoDB storage engine | Severity: | S2 (Serious) |
| Version: | mysql-5.6.18 | OS: | Linux (OL 6.5 x86_64) |
| Assigned to: | Jimmy Yang | CPU Architecture: | Any |
| Tags: | fts, full-text, IDF, ranking, rbt, red-black | ||
[11 Jun 2014 12:10]
Daniel Price
Fixed as of the upcoming 5.6.20, 5.7.5 release, and here's the changelog entry: For single item full-text searches, deleted documents were included inverse document frequency (IDF) calculations. Thank you for the bug report.
[7 Aug 2014 5:46]
Laurynas Biveinis
$ bzr log -n0 -r 5985 ------------------------------------------------------------ revno: 5985 committer: Jimmy Yang <jimmy.yang@oracle.com> branch nick: mysql-5.6 timestamp: Tue 2014-06-10 18:50:10 +0800 message: Fix Bug #18711306 - DELETED DOCUMENTS NOT GETTING SKIPPED WHEN DOING IDF CALCULATIONS rb://5426 approved by Sunny Bains
[7 Aug 2014 5:48]
Laurynas Biveinis
$ bzr log -n0 -r 5988 ------------------------------------------------------------ revno: 5988 committer: Jimmy Yang <jimmy.yang@oracle.com> branch nick: mysql-5.6 timestamp: Thu 2014-06-12 12:05:19 +0800 message: Fix testcase for Bug #18711306. The table row count is statistics value, so it could vary, result in indeterministic results.
[7 Aug 2014 6:54]
Laurynas Biveinis
$ bzr log -n0 -r 6037 ------------------------------------------------------------ revno: 6037 committer: Jimmy Yang <jimmy.yang@oracle.com> branch nick: mysql-5.6 timestamp: Fri 2014-06-27 19:07:21 +0800 message: Back port fixes for innodb_fts_misc.test from trunk, issue introduced by bug #18711306 checkin

Description: When calculating the IDF (inverse document frequency) value used in the relevancy rankings, deleted documents are not getting properly skipped. I first verified the issue in 5.6.15-Enterprise, and examined the results with gdb. I then verified with 5.6.18-Enterprise, and I got the same results. I used OL 6.5 on x86_64 for my tests. How to repeat: CREATE TABLE articles ( id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, title VARCHAR(200), body TEXT, FULLTEXT (title,body) ) ENGINE=InnoDB; INSERT INTO articles (title,body) VALUES ('Test Article','blah blah blah'),("Matt's Noise",'this is noisy'),('February Weather','It was terrible this year.'),('Peter Pan','Tis a kids story.'),('Test1','nada'),('Database database database','foo database database database'),('Database article title','body with lots of words.'),('myfulltext database', 'my test fulltext database'); SELECT id, title, body, MATCH (title,body) AGAINST ('database' IN BOOLEAN MODE) AS score FROM articles ORDER BY score DESC; DELETE from articles; COMMIT; INSERT INTO articles (title,body) VALUES ('Test Article','blah blah blah'),("Matt's Noise",'this is noisy'),('February Weather','It was terrible this year.'),('Peter Pan','Tis a kids story.'),('Test1','nada'),('Database database database','foo database database database'),('Database article title','body with lots of words.'),('myfulltext database', 'my test fulltext database'); SELECT id, title, body, MATCH (title,body) AGAINST ('database' IN BOOLEAN MODE) AS score FROM articles ORDER BY score DESC; DELETE from articles; COMMIT; INSERT INTO articles (title,body) VALUES ('Test Article','blah blah blah'),("Matt's Noise",'this is noisy'),('February Weather','It was terrible this year.'),('Peter Pan','Tis a kids story.'),('Test1','nada'),('Database database database','foo database database database'),('Database article title','body with lots of words.'),('myfulltext database', 'my test fulltext database'); SELECT id, title, body, MATCH (title,body) AGAINST ('database' IN BOOLEAN MODE) AS score FROM articles ORDER BY score DESC; We have 8 records in total, with 3 of them matching the search term “database”, and the first record (the one with the highest relative relevancy ranking) contains the search term 6 times. The relevancy ranking should be consistent at 1.0886961221694946, but it's not. The deleted documents are factoring into the rankings as they're not properly getting skipped when we do the relevancy ranking IDF calculations. Suggested fix: We should skip deleted documents when calculating the IDF value. I see in the code that this is the intention, so we've just got a bug somewhere.