MySQL Bugs: #108274: utf8mb4_polish_ci collation ignore some characters on string compare

Bug #108274	utf8mb4_polish_ci collation ignore some characters on string compare
Submitted:	24 Aug 2022 21:01	Modified:	25 Aug 2022 19:19
Reporter:	Ian Kovalski	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server	Severity:	S3 (Non-critical)
Version:	5.7.34	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
I have the following test table:
CREATE TABLE `TmpTest` (
  `Id` int(11) NOT NULL AUTO_INCREMENT,
  `TekstBin` varchar(45) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
  `TekstUtf8PL` varchar(45) COLLATE utf8mb4_polish_ci NOT NULL,
  PRIMARY KEY (`Id`)
) ENGINE=InnoDB AUTO_INCREMENT=8 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_polish_ci;

with value "test" inserted into both text columns.
SELECT * FROM TmpTest WHERE TekstUtf8PL = 'te\0st';
this statement returns 1 record, so \0 appears to be ignored when comparing strings with utf8mb4_polish_ci. On utf8mb4_bin this works fine and returns no results. I also tried 0x01 byte instead of \0 with the same behavior. Is this a bug? I expected comparing strings for equality would not have holes like this. If this should be handled by application's input sanitization, then documentation should say so, but I couldn't find anything about this.

How to repeat:
CREATE TABLE `TmpTest` (
  `Id` int(11) NOT NULL AUTO_INCREMENT,
  `TekstBin` varchar(45) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
  `TekstUtf8PL` varchar(45) COLLATE utf8mb4_polish_ci NOT NULL,
  PRIMARY KEY (`Id`)
) ENGINE=InnoDB AUTO_INCREMENT=8 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_polish_ci;

INSERT INTO TmpTest(TekstBin, TekstUtf8PL) VALUES ('test', 'test');
SELECT * FROM TmpTest WHERE TekstBin = 'te\0st';
SELECT * FROM TmpTest WHERE TekstUtf8PL = 'te\0st';

Hi Mr. Kovalski,

Thank you for your bug report.

You have chosen case insensitive collation for that column. There is a possibility that in UTFmb4, Polish language is represented by multi-byte characters. In that case, the result might be correct. 

If you know more about Polish Unicode grapheme cluster scalars, then you can tell what is case sensitive sorting like in Polish Unicode page.

We have repeated your test case on the latest 8.0.30 and the results are the same as in 5.7.

On our side, we shall have to consult our Unicode expert, which might take some time.

We are waiting on your feedback.

There are multibyte characters in Polish language, but I don't see how it relates to the issue. My test case uses valid UTF-8 with only single byte characters. Byte 0 always represent U+0000, and should be always distinct from anything else.

ASCII 0 is ignorable in utf8mb4_polish_ci so it SHOULD be ignored. Not a bug

For future reference, same applies to all ASCII control codes, and some other codepoints
https://unicode.org/reports/tr10/#Ignorables_Defn
https://www.unicode.org/Public/UCA/latest/allkeys.txt
Thanks for clearing this up.

Thank you, very much, Bernt Johnsen.