MySQL Bugs: #118238: Ngram parser boolean phrase search fails with CJK punctuation

Bug #118238	Ngram parser boolean phrase search fails with CJK punctuation
Submitted:	21 May 10:46	Modified:	2 Jun 11:27
Reporter:	jianghong sheng	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: FULLTEXT search	Severity:	S3 (Non-critical)
Version:	9.3.0,8.0.40, 8.0.42	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	boolean mode, CJK, comma, full text search, NGRAM, phrase search, punctuation

Description:
When using the ngram parser with a FULLTEXT index on a utf8mb4 column in MySQL 8.0.30, a boolean mode phrase search (e.g., AGAINST ('"你，好"' IN BOOLEAN MODE)) fails to return results if a CJK punctuation mark (like the Chinese comma '，') is included within the quoted phrase but is not the first character of the phrase.

The ngram parser appears to correctly tokenize the CJK punctuation as part of the ngrams, as evidenced by INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE (when innodb_ft_aux_table is set).
A boolean mode search without the phrase quotes (e.g., AGAINST ('你，好' IN BOOLEAN MODE)) correctly returns the row. This indicates the issue is specific to how boolean phrase searches handle ngrams containing such punctuation when the punctuation is not at the absolute beginning of the search phrase.

This behavior is similar to the issue described in Bug #116696 , which noted inconsistencies with Chinese punctuation in ngram boolean mode searches. In that bug, it was also observed that natural language mode searches worked as expected. The fix for Bug #85876, which changed how standard English commas and periods are tokenized (making them delimiters), does not seem to apply to CJK commas, which are still tokenized as part of ngrams.

How to repeat:
-- 0. Ensure ngram_token_size is 1.
--    SHOW VARIABLES LIKE 'ngram_token_size'; -- Must 1

-- 1. Create database and table
CREATE DATABASE IF NOT EXISTS t;
USE t;
SET NAMES utf8mb4;

CREATE TABLE `ft` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `text` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  PRIMARY KEY (`id`),
  FULLTEXT KEY `ftx_index_text` (`text`) WITH PARSER `ngram`
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

-- 2. Insert test data
INSERT INTO `ft` (`id`, `text`) VALUES (NULL, '你，好吗');

-- 3. Check ngram tokenization
SET GLOBAL innodb_ft_aux_table="t/ft";
SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE;
+------+--------------+-------------+-----------+--------+----------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+------+--------------+-------------+-----------+--------+----------+
| ，   |            2 |           2 |         1 |      2 |        3 |
| 你   |            2 |           2 |         1 |      2 |        0 |
| 吗   |            2 |           2 |         1 |      2 |        9 |
| 好   |            2 |           2 |         1 |      2 |        6 |
+------+--------------+-------------+-----------+--------+----------+

-- 4. Perform searches
-- This search works (boolean mode, no phrase)
SELECT * FROM ft WHERE MATCH (text) AGAINST ('你，好' IN BOOLEAN MODE);
-- Expected: Returns 1 row ('你，好吗')

-- This search fails (boolean mode, with phrase)
SELECT * FROM ft WHERE MATCH (text) AGAINST ('"你，好"' IN BOOLEAN MODE);
-- Expected: Should return 1 row ('你，好吗')
-- Actual: Returns Empty set

Hello jianghong sheng,

Thank you for the report and test case.

regards,
Umesh