Bug #49347 myisam_ftdump repeating rows in -c mode
Submitted: 2 Dec 2009 13:06 Modified: 2 Jan 2010 13:44
Reporter: Michael Baldwin Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: FULLTEXT search Severity:S2 (Serious)
Version:5.1.34-community OS:Windows
Assigned to: CPU Architecture:Any
Tags: myisam_ftdump counts

[2 Dec 2009 13:06] Michael Baldwin
Description:
I have a table with 10+ million rows of short English sentences in a single column.

I then create a FULLTEXT index on them and want to use myisam_ftdump with the -c option to get a count for each word. I am using no stopwords and a minimum word length of 1. The text field is in UTF-8 format.

Words with a few instances seem to be reported correctly, but very common words are repeated with different counts hundreds of times. For example, the section of output around the word "a" is:

        1           16.2598874 9x19
        1           16.2598874 9x20
        1           16.2598874 9x21
        1           16.2598874 9x22
        2           15.5667401 9âº
        2           15.5667401 9â½
      710            9.6945609 a
        1           16.2598874 ã¡
   108283            4.6579427 a
        1           16.2598874 ã¡
    30214            5.9412014 a
        1           16.2598874 ã€
    36404            5.7542894 a
        1           16.2598874 ã
    58446            5.2789441 a
        1           16.2598874 ã¤
     7196            7.3779822 a
        1           16.2598874 ã
      173           11.1065808 a
        1           16.2598874 ã
      685            9.7304092 a
        1           16.2598874 ã¡
    17009            6.5169124 a
        1           16.2598874 ãƒ
     1123            9.2360310 a
        1           16.2598874 ãƒ
(............ 330 lines removed ............)
       91           11.7490200 a
        1           16.2598874 ã€
    19206            6.3952413 a
        1           16.2598874 ã€
    89494            4.8501640 a
        2           15.5667401 a0
       30           12.8586875 a1
        1           16.2598874 a10
        1           16.2598874 a108
        1           16.2598874 a10s
        2           15.5667401 a11
        2           15.5667401 a1213669b1
        1           16.2598874 a13
        1           16.2598874 a14
        2           15.5667401 a15
        1           16.2598874 a17
        1           16.2598874 a18o
        3           15.1612749 a1c
        1           16.2598874 a1most
        1           16.2598874 a1o
       14           13.6208289 a2
        1           16.2598874 a21
        1           16.2598874 a217
        1           16.2598874 a220
        1           16.2598874 a25s
        1           16.2598874 a27390938662
        1           16.2598874 a29
        1           16.2598874 a2c
        1           16.2598874 a2e
        1           16.2598874 a2o
        2           15.5667401 a2ound
        1           16.2598874 a2s
        2           15.5667401 a2team
       14           13.6208289 a3
        1           16.2598874 a300
        1           16.2598874 a303
        1           16.2598874 a316
        1           16.2598874 a320
        1           16.2598874 a33
        2           15.5667401 a34
        1           16.2598874 a35

Of course, there should only be one line for "a", not multiple ones.

The word "you" is not as extreme, but also gets repeated twice:

        1           16.2598874 yotoknow
        1           16.2598874 yotsukaido
        1           16.2598874 yotsuya
        1           16.2598874 yotwo
       59           12.1823449 yotz
        1           16.2598874 yotzah
  1245944            2.1100561 you
        1           16.2598874 yã³u
  1794728            1.6902205 you
        1           16.2598874 you150
        1           16.2598874 you20
       61           12.1490083 youa
        1           16.2598874 youab

Likewise with "the":

        1           16.2598874 thdog
        1           16.2598874 thdrawals
        1           16.2598874 thds
  1002054            2.3513511 the
        1           16.2598874 thã¨
  1152892            2.1966893 the
        4           14.8735927 the1
        1           16.2598874 the18th
        1           16.2598874 the27th

In addition, the result when summing the overall counts is wrong. When I sum the instances of "a" above, the resulting number is 710. Independently I calculated the true number of word "a"s in my table, and it is roughly 1,659,895 (which makes sense, given the overall count of "you").

I suspect the bug could possibly be related to these two:

http://bugs.mysql.com/bug.php?id=21459
http://bugs.mysql.com/bug.php?id=38356

How to repeat:
Unfortunately my files are so large I'm not sure it would be helpful to upload, but if it's absolutely necessary, I can try to produce a small test case -- let me know.

Suggested fix:
I suspect it has something to do with UTF-8 encoded strings, because in every case, a word like "a" alternates with a (bad) UTF-8 encoding like "ã¢" (see bug http://bugs.mysql.com/bug.php?id=38356).
[2 Dec 2009 13:13] Michael Baldwin
Sorry, please ignore the bit about undercounting the "a" -- I'd had a mistake in my query that was not summing them, the 710 refers to the first instance only, obviously.

But the problem with repeating rows continues to clearly be a bug...
[2 Dec 2009 13:15] Michael Baldwin
(editing title of bug)
[2 Dec 2009 13:44] MySQL Verification Team
Please try with last version released. Thanks in advance.
[3 Jan 2010 0:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".