Bug #113811 Get incorrect result when searching for consecutive uppercase letters
Submitted: 30 Jan 2024 23:42 Modified: 1 Feb 2024 3:34
Reporter: caiweiqiang cai Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: FULLTEXT search Severity:S3 (Non-critical)
Version:8.0.27 5.7.28 OS:Any
Assigned to: CPU Architecture:Any

[30 Jan 2024 23:42] caiweiqiang cai
Description:
When a column is set to utf8mb4_bin, searching for consecutive uppercase letters reutrns nothing, but when set to utf8mb4_general_ci it works well.

How to repeat:
use config like this

[mysqld]
character-set-server=utf8mb4
collation-server=utf8mb4_bin
innodb_ft_enable_stopword=OFF
ft_min_word_len=1
innodb_ft_min_token_size=1
ngram_token_size=1

[mysql]
default-character-set=utf8mb4

prepare data

drop table if exists fulltext_test;
create table fulltext_test (
    content varchar(200)
);
insert into fulltext_test (content) values ('ABCDEFGHIJKLMNOPQRSTUVWXYZ');
insert into fulltext_test (content) values ('abcdefghijklmnopqrstuvwxyz');

test result

this sql return one row:
select a.content, match(a.content) against ('+abcd' in boolean mode) from fulltext_test a where match(a.content) against ('+abcd' in boolean mode);

but this sql return none:
select a.content, match(a.content) against ('+ABCD' in boolean mode) from fulltext_test a where match(a.content) against ('+ABCD' in boolean mode);

set column to utf8mb4_general_ci:
alter table fulltext_test modify content varchar(200) character set utf8mb4 collate utf8mb4_general_ci;

then both SQLs above return two rows.
[30 Jan 2024 23:43] caiweiqiang cai
Sorry, I loss one SQL to create fulltext index

alter table fulltext_test add fulltext content (content) with parser ngram;
[31 Jan 2024 10:40] MySQL Verification Team
Hi Mr. cai,

Thank you for your bug report.

However, this is not a bug.

When you use case sensitive collation, then you will find only rows that matches the constant fully, that means that 'a' would be different then 'A'.

However, when you use _ci collation, which means Case Insensitive, then 'a' will be equal to 'A'.

Hence, this is expected behaviour.

Not a bug.
[1 Feb 2024 3:34] caiweiqiang cai
Maybe you didn't get my point.

The main problem is that this SQL does not return any record when using utf8mb4_bin collation.

select a.content, match(a.content) against ('+ABCD' in boolean mode) from fulltext_test a where match(a.content) against ('+ABCD' in boolean mode);

This statement queries the rows containing the phrase "ABCD". I think it should return this row "ABCDEFGHIJKLMNOPQRSTUVWXYZ", but it doesn't.

It's stange behaviour.
[1 Feb 2024 11:05] MySQL Verification Team
Hi,

You are not using FTS correctly ......

Your query should be :

select a.content from fulltext_test a where match(a.content) against ('ABCD';

This should return correct result ...... 

Introducing MATCH in the SELECT list and using a single word in the boolean mode, both go against the SQL standard.
[1 Feb 2024 11:35] MySQL Verification Team
One more additional note.

Your 8.0 release is too old.

We ran your test case against 8.0.36 and got totally different results.

No rows in any possible variant. With BOOLEAN or no BOOLEAN.

This ia actually correct behaviour with all the settings that you used.