Bug #107329 Fulltext search returns different results based on execution plan
Submitted: 18 May 11:59 Modified: 18 May 15:32
Reporter: Max Jenkins Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: FULLTEXT search Severity:S3 (Non-critical)
Version:8.0.29 OS:Linux
Assigned to: CPU Architecture:Any
Tags: regression

[18 May 11:59] Max Jenkins
Description:
I have a reasonably simple query involving a fulltext MATCH AGAINST clause and I appear to get different results depending on the execution plan chosen. One plan returns the correct results, however the alternative plan seems to be missing some results that should be returned.

How to repeat:
CREATE SCHEMA `bug_test` ;

CREATE TABLE `bug_test`.`companies` (
  `id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
  `is_active` TINYINT(1) NOT NULL DEFAULT 0,
  PRIMARY KEY (`id`),
  INDEX `is_active_idx` (`is_active` ASC) VISIBLE);

CREATE TABLE `bug_test`.`company_phrases` (
  `id` INT(10) UNSIGNED NOT NULL,
  `value` LONGTEXT NULL,
  `company_id` INT(10) NOT NULL,
  PRIMARY KEY (`id`),
  INDEX `company_id_idx` (`company_id` ASC) VISIBLE,
  FULLTEXT INDEX `value_ft_idx` (`value`) VISIBLE);

INSERT INTO `bug_test`.`companies` (`id`, `is_active`) VALUES ('1', '1');
INSERT INTO `bug_test`.`companies` (`id`, `is_active`) VALUES ('2', '1');
INSERT INTO `bug_test`.`companies` (`id`, `is_active`) VALUES ('3', '1');
INSERT INTO `bug_test`.`companies` (`id`, `is_active`) VALUES ('4', '1');
INSERT INTO `bug_test`.`companies` (`id`, `is_active`) VALUES ('5', '1');
INSERT INTO `bug_test`.`companies` (`id`, `is_active`) VALUES ('6', '1');
INSERT INTO `bug_test`.`companies` (`id`, `is_active`) VALUES ('7', '1');
INSERT INTO `bug_test`.`companies` (`id`, `is_active`) VALUES ('8', '1');
INSERT INTO `bug_test`.`companies` (`id`, `is_active`) VALUES ('9', '1');
INSERT INTO `bug_test`.`companies` (`id`, `is_active`) VALUES ('10', '1');

INSERT INTO `bug_test`.`company_phrases` (`id`, `company_id`) VALUES ('1', '1');
INSERT INTO `bug_test`.`company_phrases` (`id`, `company_id`) VALUES ('2', '2');
INSERT INTO `bug_test`.`company_phrases` (`id`, `company_id`) VALUES ('3', '3');
INSERT INTO `bug_test`.`company_phrases` (`id`, `company_id`) VALUES ('4', '4');
INSERT INTO `bug_test`.`company_phrases` (`id`, `company_id`) VALUES ('5', '5');
INSERT INTO `bug_test`.`company_phrases` (`id`, `company_id`) VALUES ('6', '6');
INSERT INTO `bug_test`.`company_phrases` (`id`, `company_id`) VALUES ('7', '7');
INSERT INTO `bug_test`.`company_phrases` (`id`, `company_id`) VALUES ('8', '8');
INSERT INTO `bug_test`.`company_phrases` (`id`, `company_id`) VALUES ('9', '9');
INSERT INTO `bug_test`.`company_phrases` (`id`, `company_id`) VALUES ('10', '10');
INSERT INTO `bug_test`.`company_phrases` (`id`, `value`, `company_id`) VALUES ('11', 'something random first phrase something else', '1');
INSERT INTO `bug_test`.`company_phrases` (`id`, `value`, `company_id`) VALUES ('12', 'something random second phrase something else', '1');
INSERT INTO `bug_test`.`company_phrases` (`id`, `value`, `company_id`) VALUES ('13', 'something random third phrase something else', '1');
INSERT INTO `bug_test`.`company_phrases` (`id`, `value`, `company_id`) VALUES ('14', 'something random second phrase something else', '1');
INSERT INTO `bug_test`.`company_phrases` (`id`, `value`, `company_id`) VALUES ('15', 'something random second phrase something else', '2');
INSERT INTO `bug_test`.`company_phrases` (`id`, `value`, `company_id`) VALUES ('16', 'something random first phrase something else', '2');
INSERT INTO `bug_test`.`company_phrases` (`id`, `value`, `company_id`) VALUES ('17', 'something random first phrase something else', '2');
INSERT INTO `bug_test`.`company_phrases` (`id`, `value`, `company_id`) VALUES ('18', 'something random first phrase something else', '10');
INSERT INTO `bug_test`.`company_phrases` (`id`, `value`, `company_id`) VALUES ('20', 'something random first phrase something else', '7');
INSERT INTO `bug_test`.`company_phrases` (`id`, `value`, `company_id`) VALUES ('21', 'something random second phrase something else', '7');
INSERT INTO `bug_test`.`company_phrases` (`id`, `value`, `company_id`) VALUES ('22', 'something random third phrase something else', '7');
INSERT INTO `bug_test`.`company_phrases` (`id`, `value`, `company_id`) VALUES ('23', 'something random first phrase something else', '5');
INSERT INTO `bug_test`.`company_phrases` (`id`, `value`, `company_id`) VALUES ('24', 'something random second phrase something else', '5');
INSERT INTO `bug_test`.`company_phrases` (`id`, `value`, `company_id`) VALUES ('25', 'something random third phrase something else', '5');
INSERT INTO `bug_test`.`company_phrases` (`id`, `value`, `company_id`) VALUES ('26', 'something random third phrase something else', '8');
INSERT INTO `bug_test`.`company_phrases` (`id`, `value`, `company_id`) VALUES ('27', 'second phrase', '4');
INSERT INTO `bug_test`.`company_phrases` (`id`, `value`, `company_id`) VALUES ('28', 'third phrase', '4');
INSERT INTO `bug_test`.`company_phrases` (`id`, `value`, `company_id`) VALUES ('29', 'second phrase', '4');
INSERT INTO `bug_test`.`company_phrases` (`id`, `value`, `company_id`) VALUES ('30', 'second phrase', '8');

USE bug_test;

Now if I run the following query:

SELECT companies.id 
FROM companies 
WHERE companies.is_active = 1
AND EXISTS (
    SELECT *
    FROM company_phrases 
    WHERE company_phrases.company_id = companies.id
    AND MATCH(company_phrases.value) AGAINST ('+second +phrase' IN BOOLEAN MODE) 
);

The explain plan for this query looks like:
+----+-------------+-----------------+------------+----------+-------------------------------+--------------+---------+-------------------------------------+------+----------+----------------------------------------------------+
| id | select_type | table           | partitions | type     | possible_keys                 | key          | key_len | ref                                 | rows | filtered | Extra                                              |
+----+-------------+-----------------+------------+----------+-------------------------------+--------------+---------+-------------------------------------+------+----------+----------------------------------------------------+
|  1 | SIMPLE      | company_phrases | NULL       | fulltext | company_id_idx,value_ft_idx   | value_ft_idx | 0       | const                               |    1 |   100.00 | Using where; Ft_hints: no_ranking; Start temporary |
|  1 | SIMPLE      | companies       | NULL       | eq_ref   | PRIMARY,is_active_idx,bw_scan | PRIMARY      | 4       | bug_test.company_phrases.company_id |    1 |   100.00 | Using where; End temporary                         |
+----+-------------+-----------------+------------+----------+-------------------------------+--------------+---------+-------------------------------------+------+----------+----------------------------------------------------+

I get the correct results:
+----+
| id |
+----+
|  1 |
|  2 |
|  7 |
|  5 |
|  4 |
|  8 |
+----+

However if I modify the query slightly by adding an ORDER BY clause onto it in order to change the execution plan to do a backward index scan instead, using the following query:

SELECT companies.id 
FROM companies 
WHERE companies.is_active = 1
AND EXISTS (
    SELECT *
    FROM company_phrases 
    WHERE company_phrases.company_id = companies.id
    AND MATCH(company_phrases.value) AGAINST ('+second +phrase' IN BOOLEAN MODE) 
) order by companies.id DESC;

The EXPLAIN plan for this query looks like:
+----+-------------+-----------------+------------+----------+-----------------------------+---------------+---------+-------+------+----------+----------------------------------------------------------+
| id | select_type | table           | partitions | type     | possible_keys               | key           | key_len | ref   | rows | filtered | Extra                                                    |
+----+-------------+-----------------+------------+----------+-----------------------------+---------------+---------+-------+------+----------+----------------------------------------------------------+
|  1 | SIMPLE      | companies       | NULL       | ref      | PRIMARY,is_active_idx       | is_active_idx | 1       | const |   10 |   100.00 | Backward index scan; Using index                         |
|  1 | SIMPLE      | company_phrases | NULL       | fulltext | company_id_idx,value_ft_idx | value_ft_idx  | 0       | const |    1 |    10.00 | Using where; Ft_hints: no_ranking; FirstMatch(companies) |
+----+-------------+-----------------+------------+----------+-----------------------------+---------------+---------+-------+------+----------+----------------------------------------------------------+

And the incorrect results I get are:
+----+
| id |
+----+
|  8 |
|  5 |
|  4 |
|  2 |
+----+

As you can see I am now missing records 1 and 7 from my results even though all I did was add an ORDER BY clause to my query.

I can re-create this issue another way if I reformat my query to use an INNER JOIN instead of an EXISTS clause like so

SELECT DISTINCT companies.id 
FROM companies 
INNER JOIN company_phrases ON company_phrases.company_id = companies.id
WHERE companies.is_active = 1
AND MATCH(company_phrases.value) AGAINST ('+second +phrase' IN BOOLEAN MODE) 
ORDER BY companies.id DESC;

The EXPLAIN plan for this query looks like:
+----+-------------+-----------------+------------+----------+-----------------------------+--------------+---------+-------------------------------------+------+----------+--------------------------------------------------------------------+
| id | select_type | table           | partitions | type     | possible_keys               | key          | key_len | ref                                 | rows | filtered | Extra                                                              |
+----+-------------+-----------------+------------+----------+-----------------------------+--------------+---------+-------------------------------------+------+----------+--------------------------------------------------------------------+
|  1 | SIMPLE      | company_phrases | NULL       | fulltext | company_id_idx,value_ft_idx | value_ft_idx | 0       | const                               |    1 |   100.00 | Using where; Ft_hints: no_ranking; Using temporary; Using filesort |
|  1 | SIMPLE      | companies       | NULL       | eq_ref   | PRIMARY,is_active_idx       | PRIMARY      | 4       | bug_test.company_phrases.company_id |    1 |   100.00 | Using where                                                        |
+----+-------------+-----------------+------------+----------+-----------------------------+--------------+---------+-------------------------------------+------+----------+--------------------------------------------------------------------+

And the correct results again look like:
+----+
| id |
+----+
|  8 |
|  7 |
|  5 |
|  4 |
|  2 |
|  1 |
+----+

However if I add an optimiser hint to this query to force it to use a backward index scan again (using STRAIGHT_JOIN rather than INNER_JOIN) I face the same issue as before. So my query is now:

SELECT DISTINCT companies.id 
FROM companies 
STRAIGHT_JOIN company_phrases ON company_phrases.company_id = companies.id
WHERE companies.is_active = 1
AND MATCH(company_phrases.value) AGAINST ('+second +phrase' IN BOOLEAN MODE) 
ORDER BY companies.id DESC;

The EXPLAIN plan for this query looks like:
+----+-------------+-----------------+------------+----------+-----------------------+---------------+---------+-------+------+----------+---------------------------------------------------+
| id | select_type | table           | partitions | type     | possible_keys         | key           | key_len | ref   | rows | filtered | Extra                                             |
+----+-------------+-----------------+------------+----------+-----------------------+---------------+---------+-------+------+----------+---------------------------------------------------+
|  1 | SIMPLE      | companies       | NULL       | ref      | PRIMARY,is_active_idx | is_active_idx | 1       | const |   10 |   100.00 | Backward index scan; Using index; Using temporary |
|  1 | SIMPLE      | company_phrases | NULL       | fulltext | company_id_idx        | value_ft_idx  | 0       | const |    1 |    10.00 | Using where; Ft_hints: no_ranking; Distinct       |
+----+-------------+-----------------+------------+----------+-----------------------+---------------+---------+-------+------+----------+---------------------------------------------------+

And my incorrect results are once again:
+----+
| id |
+----+
|  8 |
|  5 |
|  4 |
|  2 |
+----+
where records 1 and 7 are missing even though they should be returned.

I can also re-create the behavior a third way using a WHERE ID IN instead of an INNER JOIN or EXISTS clause.

Suggested fix:
Assuming the queries are equivalent I would expect the same results regardless of which execution plan was chosen but that doesn't seem to be the case with this scenario, some of my results seem to be missing if the "backwards index scan" plan is chosen.
[18 May 15:32] MySQL Verification Team
Hello Max Jenkins,

Thank you for the report and test case.
Verified as described.

regards,
Umesh