Bug #108274 utf8mb4_polish_ci collation ignore some characters on string compare
Submitted: 24 Aug 2022 21:01 Modified: 25 Aug 2022 19:19
Reporter: Ian Kovalski Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:5.7.34 OS:Any
Assigned to: CPU Architecture:Any

[24 Aug 2022 21:01] Ian Kovalski
Description:
I have the following test table:
CREATE TABLE `TmpTest` (
  `Id` int(11) NOT NULL AUTO_INCREMENT,
  `TekstBin` varchar(45) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
  `TekstUtf8PL` varchar(45) COLLATE utf8mb4_polish_ci NOT NULL,
  PRIMARY KEY (`Id`)
) ENGINE=InnoDB AUTO_INCREMENT=8 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_polish_ci;

with value "test" inserted into both text columns.
SELECT * FROM TmpTest WHERE TekstUtf8PL = 'te\0st';
this statement returns 1 record, so \0 appears to be ignored when comparing strings with utf8mb4_polish_ci. On utf8mb4_bin this works fine and returns no results. I also tried 0x01 byte instead of \0 with the same behavior. Is this a bug? I expected comparing strings for equality would not have holes like this. If this should be handled by application's input sanitization, then documentation should say so, but I couldn't find anything about this.

How to repeat:
CREATE TABLE `TmpTest` (
  `Id` int(11) NOT NULL AUTO_INCREMENT,
  `TekstBin` varchar(45) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
  `TekstUtf8PL` varchar(45) COLLATE utf8mb4_polish_ci NOT NULL,
  PRIMARY KEY (`Id`)
) ENGINE=InnoDB AUTO_INCREMENT=8 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_polish_ci;

INSERT INTO TmpTest(TekstBin, TekstUtf8PL) VALUES ('test', 'test');
SELECT * FROM TmpTest WHERE TekstBin = 'te\0st';
SELECT * FROM TmpTest WHERE TekstUtf8PL = 'te\0st';
[25 Aug 2022 11:30] MySQL Verification Team
Hi Mr. Kovalski,

Thank you for your bug report.

You have chosen case insensitive collation for that column. There is a possibility that in UTFmb4, Polish language is represented by multi-byte characters. In that case, the result might be correct. 

If you know more about Polish Unicode grapheme cluster scalars, then you can tell what is case sensitive sorting like in Polish Unicode page.

We have repeated your test case on the latest 8.0.30 and the results are the same as in 5.7.

On our side, we shall have to consult our Unicode expert, which might take some time.

We are waiting on your feedback.
[25 Aug 2022 14:44] Ian Kovalski
There are multibyte characters in Polish language, but I don't see how it relates to the issue. My test case uses valid UTF-8 with only single byte characters. Byte 0 always represent U+0000, and should be always distinct from anything else.
[25 Aug 2022 14:48] Bernt Marius Johnsen
ASCII 0 is ignorable in utf8mb4_polish_ci so it SHOULD be ignored. Not a bug
[25 Aug 2022 19:19] Ian Kovalski
For future reference, same applies to all ASCII control codes, and some other codepoints
https://unicode.org/reports/tr10/#Ignorables_Defn
https://www.unicode.org/Public/UCA/latest/allkeys.txt
Thanks for clearing this up.
[26 Aug 2022 11:39] MySQL Verification Team
Thank you, very much, Bernt Johnsen.