MySQL Bugs: #17299: Inefficient optimizer for fulltext queries.

Bug #17299	Inefficient optimizer for fulltext queries.
Submitted:	10 Feb 2006 12:35	Modified:	23 Jan 2014 9:53
Reporter:	Andre Timmer	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S3 (Non-critical)
Version:	5.0.18	OS:	Any (All)
Assigned to:		CPU Architecture:	Any

Description:
Conditions in where clause on a fulltext index have a fixed rating according to the optimizer. This results in suboptimal query execution plans.

The expected number of rows is always 1, zie "How to repeat".

When executing a query like:
- ... where match(table1.city)        against ('+amsterdam*'    in boolean mode)
      and    match(table2.company) against ('+unique_name*' in boolean mode)

the order in the execution plan is the same as for the following query

- ... where match(table1.city)        against ('+few_houses*'   in boolean mode)
      and  match(table2.company) against ('+common_name*' in boolean mode)

For the following query the tables should be accessed in a different order.

How to repeat:
==> query
    select
          aa.bkwi_id                   
    ,     aa.bkwi_volgnr               
    ,     kk.fictief_finr              
    ,     kk.inschrijfnr_kvk           
    ,     kk.volgnr_kvk volgnr
    ,     aa.type                      
    ,     aa.ind_economisch_actief     
    ,     aa.ind_nomail                
    ,     aa.klasse_werkzame_personen  
    ,     aa.werkzame_personen         
    ,     aa.werkzame_personen_fulltime
    ,     bb.naam                      
    from
          vestiging        aa
          join sleutel                kk on (kk.bkwi_id = aa.bkwi_id and kk.bkwi_volgnr = aa.bkwi_volgnr)
          join adres_googliaans       cc on (cc.bkwi_id = aa.bkwi_id and cc.bkwi_volgnr = aa.bkwi_volgnr)
          join woonplaats_googliaans  dd on (cc.woonplaats = dd.woonplaats)
          join handelsnaam_googliaans bb on (bb.bkwi_id = aa.bkwi_id and bb.bkwi_volgnr = aa.bkwi_volgnr)
    where
          1=1
    and        match(dd.woonplaats)       against ('+amsterdam*' in boolean mode)
    and        match(bb.naam)       against ('+xyz*' in boolean mode)
    limit 401

==> plan
+----+-------------+-------+----------+-----------------------------------------------------+---------------------------+---------+-----------------------------------+------+-------------+
| id | select_type | table | type     | possible_keys                                       | key                       | key_len | ref                               | rows | Extra       |
+----+-------------+-------+----------+-----------------------------------------------------+---------------------------+---------+-----------------------------------+------+-------------+
|  1 | SIMPLE      | dd    | fulltext | woonplaats_googliaans_i1                            | woonplaats_googliaans_i1  | 0       |                                   |    1 | Using where |
|  1 | SIMPLE      | bb    | fulltext | handelsnaam_googliaans_i1,handelsnaam_googliaans_i2 | handelsnaam_googliaans_i2 | 0       |                                   |    1 | Using where |
|  1 | SIMPLE      | kk    | eq_ref   | PRIMARY                                             | PRIMARY                   | 6       | bbr.bb.bkwi_id,bbr.bb.bkwi_volgnr |    2 |             |
|  1 | SIMPLE      | cc    | ref      | adres_googliaans_i1,adres_googliaans_i4             | adres_googliaans_i1       | 6       | bbr.kk.bkwi_id,bbr.kk.bkwi_volgnr |    2 | Using where |
|  1 | SIMPLE      | aa    | eq_ref   | PRIMARY                                             | PRIMARY                   | 6       | bbr.cc.bkwi_id,bbr.kk.bkwi_volgnr |    1 | Using where |
+----+-------------+-------+----------+-----------------------------------------------------+---------------------------+---------+-----------------------------------+------+-------------+

Suggested fix:
Make optimizer intelligent for fulltext indexes, let it estimate expected nr or rows.

Example:
- query:  ..  match(company) against ('+office* +comp* +a10*' in boolean mode)
- suppose table has 1.000.000 rows 
- suppose 100.000 words start with office
- suppose   50.000 words start with comp
- suppose       100 words start with a10

Then the estimated nr of rows could be:
(100.000 / 1mln + 50.000 / 1mln + 100 / 1mln) * 1mln = 5 rows

On my dataset, the converse of this is also true:  Any boolean query with disjunction (OR's) and a fulltext operation in the where clause is estimated to return all rows in the table, so no index (whether fulltext or any other applicable one) is used.  In fact, if two MATCH functions are OR'ed, the fulltext index doesn't even show up in possible_keys (although this can be solved by lumping all the MATCH'es together).  For example:

select version();
+------------+
| version()  |
+------------+
| 5.0.17-log |
+------------+
1 row in set (0.00 sec)

CREATE TABLE `bigtable` (
  `rank` int(10) unsigned NOT NULL auto_increment,
  `text` char(48) NOT NULL,
  `frequency` int(10) unsigned NOT NULL,
  `sumFreq` bigint(20) unsigned default NULL,
  PRIMARY KEY  (`rank`),
  UNIQUE KEY `textndx` (`text`),
  FULLTEXT KEY `fulltextndx` (`text`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 DATA DIRECTORY='/vol28/mysql/data/textdb/' INDEX DIRECTORY='/vol32/mysql/data/textdb/'

select count(*) from bigtable;
+----------+
| count(*) |
+----------+
|  4329556 |
+----------+
1 row in set (0.00 sec)

explain select * from bigtable where match(text) against ("test" in boolean mode) or text like "car%";
+----+-------------+-------------------------------------------------------------+------+-----------------------+------+---------+------+---------+-------------+
| id | select_type | table                                                       | type | possible_keys         | key  | key_len | ref  | rows    | Extra       |
+----+-------------+-------------------------------------------------------------+------+-----------------------+------+---------+------+---------+-------------+
|  1 | SIMPLE      | bigtable | ALL  | textndx,fulltextndx | NULL | NULL    | NULL | 4329556 | Using where |
+----+-------------+-------------------------------------------------------------+------+-----------------------+------+---------+------+---------+-------------+
1 row in set (0.00 sec)

explain select * from bigtable where match(text) against ("test" in boolean mode) or match(text) against ("car" in boolean mode);
+----+-------------+-------------------------------------------------------------+------+---------------+------+---------+------+---------+-------------+
| id | select_type | table                                                       | type | possible_keys | key  | key_len | ref  | rows    | Extra       |
+----+-------------+-------------------------------------------------------------+------+---------------+------+---------+------+---------+-------------+
|  1 | SIMPLE      | bigtable | ALL  | NULL          | NULL | NULL    | NULL | 4329556 | Using where |
+----+-------------+-------------------------------------------------------------+------+---------------+------+---------+------+---------+-------------+
1 row in set (0.00 sec)

My text values are not particularly odd, but I could not duplicate this with a small test table, except for possible_keys being NULL...but in this case it uses the normal index despite that:

mysql> create table fulltext_and_normal (text_column char(48) not null, unique key text_column_idx (text_column), fulltext key text_column_fulltext_idx (text_column));
Query OK, 0 rows affected (0.05 sec)

mysql> insert into fulltext_and_normal values("eric jensen"), ("e. jensen"), ("jensen");                                                                        
Query OK, 3 rows affected (0.01 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> explain select * from fulltext_and_normal where match(text_column) against ("jensen" in boolean mode) or text_column like "eric%";                       +----+-------------+---------------------+-------+------------------------------------------+-----------------+---------+------+------+--------------------------+
| id | select_type | table               | type  | possible_keys                            | key             | key_len | ref  | rows | Extra                    |
+----+-------------+---------------------+-------+------------------------------------------+-----------------+---------+------+------+--------------------------+
|  1 | SIMPLE      | fulltext_and_normal | index | text_column_idx,text_column_fulltext_idx | text_column_idx | 48      | NULL |    3 | Using where; Using index |
+----+-------------+---------------------+-------+------------------------------------------+-----------------+---------+------+------+--------------------------+
1 row in set (0.00 sec)

mysql> explain select * from fulltext_and_normal where match(text_column) against ("jensen" in boolean mode) or match(text_column) against ("eric" in boolean mode);
+----+-------------+---------------------+-------+---------------+-----------------+---------+------+------+--------------------------+
| id | select_type | table               | type  | possible_keys | key             | key_len | ref  | rows | Extra                    |
+----+-------------+---------------------+-------+---------------+-----------------+---------+------+------+--------------------------+
|  1 | SIMPLE      | fulltext_and_normal | index | NULL          | text_column_idx | 48      | NULL |    3 | Using where; Using index |
+----+-------------+---------------------+-------+---------------+-----------------+---------+------+------+--------------------------+
1 row in set (0.00 sec)

The statistics on bigtable should be fine, as I did a REPAIR TABLE when upgrading from mysql 4 to 5 to fix the indicies.  The problem existed in mysql 4 too.

For what it's worth, here is the output of myisam_ftdump on that fulltext index of bigtable:

Total rows: 4329556
Total words: 14197741
Unique words: 1106119
Longest word: 48 chars (000000000000000000000000000000000000000000000000)
Median length: 5
Average global weight: 14.807066
Most common word: 425577 times, weight: 2.216306 (com)

This will be fixed not earlier than in 5.2.