Bug #17705 FT Index corruption occurs with UTF8 data when ft_min_word_len=3
Submitted: 24 Feb 2006 17:57 Modified: 31 Mar 2006 7:03
Reporter: Matthew Lord Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: MyISAM storage engine Severity:S1 (Critical)
Version:4.1.18 OS:independent
Assigned to: Michael Widenius CPU Architecture:Any
Tags: corruption, myisam

[24 Feb 2006 17:57] Matthew Lord
Description:
The table is using the UTF8 charset and it appears that rows which have non latin characters in them 
cause index corruption when updated or deleted when ft_min_word_len is set to 3.

How to repeat:
install latest build
untar TITLE-min.tar.gz in DATADIR (file attached)
./mysqld --no-defaults --ft_min_word_len=3 --skip-innodb --skip-grant-tables --basedir=.. &
./mysql
use chancellor3;
select * from TITLE where TITLE_INTERNAL_ID = 32318; 
delete from TITLE where TITLE_INTERNAL_ID = 32318;

I left some problematic rows along with non-problematic rows in the dataset so while
it's small you can still compare the rows.  As I mentioned the problem seems to be 
with the rows that have non latin1 characters.  Here are the rows that I know are 
problematic (IDs):
32318
35425
39370
[26 Feb 2006 11:23] Sergey Vojtovich
Matthew,

I believe it is not a bug or it has been fixed already. After issuing
REPAIR TABLE statement on the dataset DELETE works just fine.
Also I've noticed (while reading CSC conversation) that myisamchk
was incorrectly used.

Could you please make sure that myisamchk was run correctly? Also it is
wise to use repair table rather than myisamchk in this regard.

Also please make sure that version you use has fixes for bugs:
bug#5686, bug#11336, bug#16489. Last one present since Jan 23.

Excerpt from manual:

If you modify full-text variables that affect indexing (ft_min_word_len, ft_max_word_len, or ft_stopword_file), or if you change the stopword file itself, you must rebuild your FULLTEXT indexes after making the changes and restarting the server. To rebuild the indexes in this case, it is sufficient to do a QUICK repair operation:

mysql> REPAIR TABLE tbl_name QUICK;

With regard specifically to using the IN BOOLEAN MODE capability, if you upgrade from MySQL 3.23 to 4.0 or later, it is necessary to replace the index header as well. To do this, perform a USE_FRM repair operation: 

mysql> REPAIR TABLE tbl_name USE_FRM;

This is necessary because boolean full-text searches require a flag in the index header that was not present in MySQL 3.23, and that is not added if you do only a QUICK repair. If you attempt a boolean full-text search without rebuilding the indexes this way, the search returns incorrect results. 

Note that if you use myisamchk to perform an operation that modifies table indexes (such as repair or analyze), the FULLTEXT indexes are rebuilt using the default full-text parameter values for minimum word length, maximum word length, and stopword file unless you specify otherwise. This can result in queries failing. 

The problem occurs because these parameters are known only by the server. They are not stored in MyISAM index files. To avoid the problem if you have modified the minimum or maximum word length or stopword file values used by the server, specify the same ft_min_word_len, ft_max_word_len, and ft_stopword_file values to myisamchk that you use for mysqld. For example, if you have set the minimum word length to 3, you can repair a table with myisamchk like this: 

shell> myisamchk --recover --ft_min_word_len=3 tbl_name.MYI
[7 Mar 2006 7:14] Sergey Vojtovich
I was able to repeat this problem using new dataset.
[15 Mar 2006 11:00] Sergey Vojtovich
This bug is not about fulltext index, but rather about REPAIR TABLE.
I was able to repeat this bug using trivial table:
CREATE TABLE `t1` (
  `TITLE_DISPLAY` varchar(255) default NULL,
  KEY `title_display_idx` (`TITLE_DISPLAY`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

As a workaround one can dump that table and load it back. It works just fine this way.
[28 Mar 2006 9:21] Sergey Vojtovich
New test case for this bug.

Attachment: t1.sql (application/octet-stream, text), 487 bytes.

[30 Mar 2006 9:38] Michael Widenius
Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

The problem had nothing to do with FT index, but instead with compression of keys. The bug can appear in MyISAM for any key where the first part is a CHAR/VARCHAR longer than 254 bytes

Fix will be in 4.1.19 an 5.0.20
[31 Mar 2006 7:03] Jon Stephens
Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

Documented bugfix in 4.1.19 and 5.0.20 changelogs. Closed.
[31 Mar 2006 7:15] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/4354