Bug #21459 myisam_ftdump gives bad counts for common words
Submitted: 5 Aug 2006 16:18 Modified: 4 Oct 2006 20:29
Reporter: Paul Cosway Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: FULLTEXT search Severity:S3 (Non-critical)
Version:5.0.21 OS:Linux (linux)
Assigned to: Sergey Vojtovich CPU Architecture:Any

[5 Aug 2006 16:18] Paul Cosway
Description:
The "per word" loop in myisam_ftdump does not correctly process words (subkeys < 0).   Using the source for the 5.0.9 beta as reference, when the "count" or "stats" options are true, the loop always increments doc_cnt  (on line 00129), rather than add in the subkey count.

Example:  When I run myisam_ftdump with the -d and the -c options, I see very different results:

-c suggests the word rock appears only once:    1           13.6263311 rock
-d tells the true story, that it appears 13,707 times:         9cdb000 =>             13707 rock 

How to repeat:
Build a fulltext index with some frequenly appearing words (more than 200 occurences?).   Run myisam_ftdump with -c and -d options and compare the output for the most frequently appearing words.
[10 Aug 2006 7:48] Sveta Smirnova
Thank you for the report. But 5.0.21 is old enough. Could you please check with current 5.0.24 version?
[18 Aug 2006 11:34] Paul Cosway
As far as I can tell the source for 5.0.24 is identical to that for 5.0.9 beta (which I used to determine there might be a problem).  When was the code for myisam_ftdump.c last changed?
[23 Aug 2006 18:14] Sveta Smirnova
Paul, you are right: mysqm_ftdump.c last changed in April. But I can not catch the problem using my test data and last BK sources:

CREATE TABLE `bug21459` (
  `foo` text NOT NULL,
  FULLTEXT KEY `foo` (`foo`)
) ENGINE=MyISAM;

$myisam_ftdump data/test/bug21459 0
Total rows: 500
Total words: 920
Unique words: 4
Longest word: 7 chars (chicken)
Median length: 4
Average global weight: 0.200815
Most common word: 309 times, weight: -0.481068 (chicken)

$myisam_ftdump -c data/test/bug21459 0
      283           -0.2655495 bird
      309           -0.4810678 chicken
      225            0.2006707 pups
      103            1.3492073 tiger

$myisam_ftdump -d data/test/bug21459 0
...
     3480            0.9775171 bird
     34a8            0.9775171 bird
     34bc            0.9775171 bird
        0            0.9666505 chicken
       28            0.9886308 chicken
       58            0.9886308 chicken
... repeats 309 times ...
     342c            0.9666505 chicken
     3450            0.9886308 chicken
     34e4            0.9775171 chicken
        0            0.9666505 pups
       74            0.9666505 pups
       98            0.9666505 pups

Test data has filled with help of PHP script:
<?php

$link = mysql_connect('localhost:3306', 'root');

mysql_select_db('test');

$data = array('tiger', 'cat', 'dog', 'egg', 'chicken', 'cow', 'bird', 'pups',);

$quan = count($data) - 1;

for ($i =0; $i < 500; $i ++) {
	$p1 = rand(0, $quan);
	$p2 = rand(0, $quan);
	$temp = array_slice($data, $p1 <= $p2 ? $p1 : $p2, $p1 <= $p2 ? $p2 : $p1);
	mysql_query("insert into bug21459 values('" . mysql_escape_string(implode(' ', $temp)) . "')");
}
?>
[23 Aug 2006 18:16] Sveta Smirnova
Paul, if you can to provide data I can repeat the issue, please provide it and change status of bug to "Open".
[24 Aug 2006 6:59] Sergey Vojtovich
A test case for this bug:
CREATE TABLE t1 (a TEXT, FULLTEXT(a));
INSERT INTO t1 VALUES
('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),
('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),
('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),
('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),
('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),
('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),
('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),
('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),
('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),
('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),
('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),
('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),
('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),
('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),
('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),
('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),
('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),
('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),
('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),
('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),('w200'),
('w003'),('w003'),('w003');
[24 Aug 2006 7:08] Sveta Smirnova
Verified on Linux using test case provided by Sergey and last BK sources:

ssmirnova@shella ~/mysql5.0b
$bin/myisam_ftdump data/test/t1 0
Total rows: 203
Total words: 4
Unique words: 2
Longest word: 4 chars (w003)
Median length: 4
Average global weight: 4.753986
Most common word: 3 times, weight: 4.199705 (w003)

ssmirnova@shella ~/mysql5.0b
$bin/myisam_ftdump -c data/test/t1 0
        3            4.1997051 w003
        1            5.3082677 w200

ssmirnova@shella ~/mysql5.0b
$bin/myisam_ftdump -d  data/test/t1 0
      fa0            0.9886308 w003
      fb4            0.9886308 w003
      fc8            0.9886308 w003
      c00 =>               200 w200
[20 Sep 2006 19:17] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/12270

ChangeSet@1.2277, 2006-09-20 17:12:37+05:00, svoj@mysql.com +1 -0
  BUG#21459 - myisam_ftdump gives bad counts for common words
  
  This problem affects myisam_ftdump tool only.
  
  For fulltext index positive subkeys means word weight, negative subkeys
  means number of documents in level 2 fulltext index.
  
  Fixed that document counter was not properly updated for keys having
  level 2 fulltext index.
  
  No test case for this bug.
[3 Oct 2006 8:50] Sergey Vojtovich
Fixed in 5.0.26, 5.1.12.
[4 Oct 2006 20:29] Paul DuBois
Noted in 5.0.26, 5.1.12 changelogs.

mysql_ftdump produced bad counts for common words.