Bug #120681 ngram FULLTEXT parser fails to match certain Latin tokens (e.g. "Sarah") in any query mode
Submitted: 14 Jun 5:43
Reporter: Michael Guo Email Updates:
Status: Open Impact on me:
None 
Category:MySQL Server: FULLTEXT search Severity:S2 (Serious)
Version:9.7 OS:Debian (Debian GNU/Linux 12 (bookworm))
Assigned to: CPU Architecture:x86 ((86_64)
Tags: fulltext innodb, Ingram

[14 Jun 5:43] Michael Guo
Description:
A FULLTEXT index built with the ngram parser fails to return rows for certain
ordinary Latin search terms, even though the indexed text plainly contains the
term. The same data indexed with the **default** parser matches correctly, so
this is specific to the ngram parser, not the data or the query.

The failure is independent of:

- **query mode** — fails in both BOOLEAN MODE and NATURAL LANGUAGE MODE;
- **query formulation** — `AGAINST('Sarah')`, `AGAINST('Sarah*')`, phrase
  `AGAINST('"sarah"')`, and per-bigram `AGAINST('+sa +ar +ra +ah')` all return
  nothing;
- **table state** — reproduces on a freshly created table and a single freshly
  inserted row, and persists after `OPTIMIZE TABLE` (full rebuild). Not a
  cache/desync issue.

Affected tokens observed: `Sarah`, `Sara`, `Karah`, `Adam` (return nothing).
Unaffected tokens (same setup): `Grace`, `John`, `Bob`, `Otto`, `Anna`, `Mark`
(match correctly). `ngram_token_size=2`, `innodb_ft_min_token_size=3` (defaults).
The exact trigger condition is not fully characterized; the minimal testcase
below isolates one failing vs one passing token.

How to repeat:
```sql
-- ngram-parser table
CREATE TABLE t_ngram (
  id INT PRIMARY KEY AUTO_INCREMENT,
  name VARCHAR(100),
  FULLTEXT (name) WITH PARSER ngram
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

-- identical data, default parser, for comparison
CREATE TABLE t_default (
  id INT PRIMARY KEY AUTO_INCREMENT,
  name VARCHAR(100),
  FULLTEXT (name)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

INSERT INTO t_ngram   (name) VALUES ('Sarah'), ('Grace');
INSERT INTO t_default (name) VALUES ('Sarah'), ('Grace');

-- ngram parser:
SELECT name FROM t_ngram   WHERE MATCH(name) AGAINST('Sarah' IN BOOLEAN MODE); -- returns 0 rows  (BUG)
SELECT name FROM t_ngram   WHERE MATCH(name) AGAINST('Grace' IN BOOLEAN MODE); -- returns 'Grace' (OK)

-- default parser, same data:
SELECT name FROM t_default WHERE MATCH(name) AGAINST('Sarah' IN BOOLEAN MODE); -- returns 'Sarah' (OK)

-- also fails in natural language mode:
SELECT name FROM t_ngram   WHERE MATCH(name) AGAINST('Sarah');                 -- returns 0 rows  (BUG)

-- the row indisputably exists:
SELECT name FROM t_ngram   WHERE name = 'Sarah';                               -- returns 'Sarah'
```

## Expected results

`MATCH(name) AGAINST('Sarah' ...)` on `t_ngram` returns the `Sarah` row (as the
default-parser table does).

## Actual results

`t_ngram` returns 0 rows for `Sarah` in every query mode, while `Grace` (same
table) matches and the default parser matches `Sarah`.