Description:
I have a table with 10+ million rows of short English sentences in a single column.
I then create a FULLTEXT index on them and want to use myisam_ftdump with the -c option to get a count for each word. I am using no stopwords and a minimum word length of 1. The text field is in UTF-8 format.
Words with a few instances seem to be reported correctly, but very common words are repeated with different counts hundreds of times. For example, the section of output around the word "a" is:
1 16.2598874 9x19
1 16.2598874 9x20
1 16.2598874 9x21
1 16.2598874 9x22
2 15.5667401 9âº
2 15.5667401 9â½
710 9.6945609 a
1 16.2598874 ã¡
108283 4.6579427 a
1 16.2598874 ã¡
30214 5.9412014 a
1 16.2598874 ã€
36404 5.7542894 a
1 16.2598874 ã
58446 5.2789441 a
1 16.2598874 ã¤
7196 7.3779822 a
1 16.2598874 ã
173 11.1065808 a
1 16.2598874 ã
685 9.7304092 a
1 16.2598874 ã¡
17009 6.5169124 a
1 16.2598874 ãƒ
1123 9.2360310 a
1 16.2598874 ãƒ
(............ 330 lines removed ............)
91 11.7490200 a
1 16.2598874 ã€
19206 6.3952413 a
1 16.2598874 ã€
89494 4.8501640 a
2 15.5667401 a0
30 12.8586875 a1
1 16.2598874 a10
1 16.2598874 a108
1 16.2598874 a10s
2 15.5667401 a11
2 15.5667401 a1213669b1
1 16.2598874 a13
1 16.2598874 a14
2 15.5667401 a15
1 16.2598874 a17
1 16.2598874 a18o
3 15.1612749 a1c
1 16.2598874 a1most
1 16.2598874 a1o
14 13.6208289 a2
1 16.2598874 a21
1 16.2598874 a217
1 16.2598874 a220
1 16.2598874 a25s
1 16.2598874 a27390938662
1 16.2598874 a29
1 16.2598874 a2c
1 16.2598874 a2e
1 16.2598874 a2o
2 15.5667401 a2ound
1 16.2598874 a2s
2 15.5667401 a2team
14 13.6208289 a3
1 16.2598874 a300
1 16.2598874 a303
1 16.2598874 a316
1 16.2598874 a320
1 16.2598874 a33
2 15.5667401 a34
1 16.2598874 a35
Of course, there should only be one line for "a", not multiple ones.
The word "you" is not as extreme, but also gets repeated twice:
1 16.2598874 yotoknow
1 16.2598874 yotsukaido
1 16.2598874 yotsuya
1 16.2598874 yotwo
59 12.1823449 yotz
1 16.2598874 yotzah
1245944 2.1100561 you
1 16.2598874 yã³u
1794728 1.6902205 you
1 16.2598874 you150
1 16.2598874 you20
61 12.1490083 youa
1 16.2598874 youab
Likewise with "the":
1 16.2598874 thdog
1 16.2598874 thdrawals
1 16.2598874 thds
1002054 2.3513511 the
1 16.2598874 thã¨
1152892 2.1966893 the
4 14.8735927 the1
1 16.2598874 the18th
1 16.2598874 the27th
In addition, the result when summing the overall counts is wrong. When I sum the instances of "a" above, the resulting number is 710. Independently I calculated the true number of word "a"s in my table, and it is roughly 1,659,895 (which makes sense, given the overall count of "you").
I suspect the bug could possibly be related to these two:
http://bugs.mysql.com/bug.php?id=21459
http://bugs.mysql.com/bug.php?id=38356
How to repeat:
Unfortunately my files are so large I'm not sure it would be helpful to upload, but if it's absolutely necessary, I can try to produce a small test case -- let me know.
Suggested fix:
I suspect it has something to do with UTF-8 encoded strings, because in every case, a word like "a" alternates with a (bad) UTF-8 encoding like "ã¢" (see bug http://bugs.mysql.com/bug.php?id=38356).