Bug #2101 fulltext index bug - ignores words like "microsoft" (honest!)
Submitted: 11 Dec 2003 19:59 Modified: 12 Dec 2003 14:39
Reporter: Martin Virtel Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:4.0.16 OS:Linux (linux)
Assigned to: Dean Ellis CPU Architecture:Any

[11 Dec 2003 19:59] Martin Virtel
Description:
This one is funny - there seems to be a anti-monopoly bias in the fulltext index. You are not able to search for the words "microsoft" and "cisco" - these are the only examples I found, of course I tried "intel" as well.

Below are two typical queries, one with "like", and one with "match", against a table with fulltext index on the field title. "like" returns 250 matches for "microsoft", "match" returns none. Searching with "match" for "microsof*" returns the matches though. This does not apply for "cisc*" or even "cis*".

Strange.

MySQL Version is 4.0.16 on linux. 

I use these settings:
set-variable = ft_min_word_len=2
set-variable = ft_max_word_len=80
set-variable = ft_max_word_len_for_sort=20
set-variable = ft_stopword_file=/var/lib/mysql/stopwords

(stopwords file is empty)

Index Size is less than 4 MB

mysql> select count(title) from CachedStory where title like "%microsoft%";
+--------------+
| count(title) |
+--------------+
|          257 |
+--------------+
1 row in set (0.05 sec)

mysql> select count(title) from CachedStory where match title against ("microsoft" in boolean mode);
+--------------+
| count(title) |
+--------------+
|            0 |
+--------------+
1 row in set (0.00 sec)

mysql> select count(title) from CachedStory where match title against ("microsof*" in boolean mode);
+--------------+
| count(title) |
+--------------+
|           25 |
+--------------+
1 row in set (0.00 sec)

How to repeat:
use test;

drop table if exists fttest ;

create table fttest (
   title  char(80)
   ) ;

create fulltext index title_text on fttest(title);

insert into fttest values (" microsoft cisco doesn't find microsoft ");

select "match 'microsoft' returns: ";
select * from fttest where match title against ("microsoft" in boolean mode);

select "match 'cisco' returns: ";
select * from fttest where match title against ("cisco" in boolean mode);

select "match 'find' returns: ";
select * from fttest where match title against ("find" in boolean mode);

Suggested fix:
talk to the guy who inserted this easter egg
[12 Dec 2003 7:19] Dean Ellis
I cannot repeat this behavior with 4.0.16 or the latest development sources.

Thank you.
[12 Dec 2003 12:58] Martin Virtel
I was not kidding. I re-tried and repeated the bug. Please run the following MySQL script and see if your results differ, that is if you get a count of 1 for both "match" and "like" searches.

MV

Script--------------------------------------------------------------------

use test;

drop table if exists fttest ;

create table fttest (
   title  char(80)
   ) ;

create fulltext index title_text on fttest(title);

insert into fttest values (" microsoft cisco doesn't find microsoft ");

select "match microsoft", count(title) from fttest where match title against ("microsoft" in boolean mode);
select "find microsoft", count(title) from fttest where title like " %microsoft%";

select "match cisco", count(title) from fttest where match title against ("cisco" in boolean mode);
select "find cisco ", count(title) from fttest where title like " %cisco%";

Results------------------------------------------------------------------------

match microsoft count(title)
match microsoft 0
find microsoft  count(title)
find microsoft  1

match cisco     count(title)
match cisco     0
find cisco      count(title)
find cisco      1
[12 Dec 2003 13:09] Dean Ellis
All report a count of 1,  unless I place "microsoft" in a stopwords file...  As you are using an empty stopword file, and as our internal stopword list does not contain "microsoft", this remains unrepeatable.
[12 Dec 2003 14:07] Martin Virtel
Thanks for the persistence. My last idea is to check if differences in the mysql settings can explain what is happening, especially since I use german1 as a character set so I can search for äü etc. Below is my output of "show variables".
M

Variable_name	Value
back_log	50
basedir	/usr/local/mysql-standard-4.0.16-pc-linux-i686/
binlog_cache_size	32768
bulk_insert_buffer_size	8388608
character_set	german1
character_sets	latin1 big5 czech euc_kr gb2312 gbk latin1_de sjis tis620 ujis dec8 dos german1 hp8 koi8_ru latin2 swe7 usa7 cp1251 danish hebrew win1251 estonia hungarian koi8_ukr win1251ukr greek win1250 croat cp1257 latin5
concurrent_insert	ON
connect_timeout	5
convert_character_set	
datadir	/var/lib/mysql.4.0.16/
default_week_format	0
delay_key_write	ON
delayed_insert_limit	100
delayed_insert_timeout	300
delayed_queue_size	1000
flush	OFF
flush_time	0
ft_boolean_syntax	+ -><()~*:""&|
ft_min_word_len	2
ft_max_word_len	80
ft_max_word_len_for_sort	20
ft_stopword_file	/var/lib/mysql/stopwords
have_bdb	NO
have_crypt	YES
have_innodb	YES
have_isam	YES
have_raid	NO
have_symlink	YES
have_openssl	NO
have_query_cache	YES
init_file	
innodb_additional_mem_pool_size	1048576
innodb_buffer_pool_size	8388608
innodb_data_file_path	ibdata1:10M:autoextend
innodb_data_home_dir	
innodb_file_io_threads	4
innodb_force_recovery	0
innodb_thread_concurrency	8
innodb_flush_log_at_trx_commit	1
innodb_fast_shutdown	ON
innodb_flush_method	
innodb_lock_wait_timeout	50
innodb_log_arch_dir	./
innodb_log_archive	OFF
innodb_log_buffer_size	1048576
innodb_log_file_size	5242880
innodb_log_files_in_group	2
innodb_log_group_home_dir	./
innodb_mirrored_log_groups	1
innodb_max_dirty_pages_pct	90
interactive_timeout	28800
join_buffer_size	131072
key_buffer_size	8388600
language	/usr/local/mysql-standard-4.0.16-pc-linux-i686/share/mysql/english/
large_files_support	ON
local_infile	ON
locked_in_memory	OFF
log	OFF
log_update	OFF
log_bin	OFF
log_slave_updates	OFF
log_slow_queries	ON
log_warnings	OFF
long_query_time	10
low_priority_updates	OFF
lower_case_table_names	OFF
max_allowed_packet	1048576
max_binlog_cache_size	4294967295
max_binlog_size	1073741824
max_connections	1000
max_connect_errors	10
max_delayed_threads	20
max_heap_table_size	16777216
max_join_size	4294967295
max_relay_log_size	0
max_seeks_for_key	4294967295
max_sort_length	1024
max_user_connections	0
max_tmp_tables	32
max_write_lock_count	4294967295
myisam_max_extra_sort_file_size	268435456
myisam_max_sort_file_size	2147483647
myisam_repair_threads	1
myisam_recover_options	OFF
myisam_sort_buffer_size	8388608
net_buffer_length	16384
net_read_timeout	30
net_retry_count	10
net_write_timeout	60
new	OFF
open_files_limit	5010
pid_file	/var/lib/mysql.4.0.16//xbox.newsradar.org.pid
log_error	
port	3306
protocol_version	10
query_alloc_block_size	8192
query_cache_limit	1048576
query_cache_size	0
query_cache_type	ON
query_prealloc_size	8192
range_alloc_block_size	2048
read_buffer_size	131072
read_only	OFF
read_rnd_buffer_size	262144
rpl_recovery_rank	0
server_id	0
slave_net_timeout	3600
skip_external_locking	ON
skip_networking	OFF
skip_show_database	OFF
slow_launch_time	2
socket	/var/lib/mysql/mysql.sock
sort_buffer_size	2097144
sql_mode	0
table_cache	64
table_type	MYISAM
thread_cache_size	0
thread_stack	126976
tx_isolation	REPEATABLE-READ
timezone	UTC
tmp_table_size	33554432
tmpdir	/tmp/
transaction_alloc_block_size	8192
transaction_prealloc_size	4096
version	4.0.16-standard-log
wait_timeout	28800
[12 Dec 2003 14:39] Dean Ellis
No luck.  Cannot repeat unless I explicitly place microsoft into the stopwords, even with german1 character set.
[13 Dec 2003 6:52] Martin Virtel
Thanks, Dean. In fact you were right and somebody meddled with the customized stopwords file at our site, that was the source of the bug. Sorry for not checking that earlier - MV