Bug #108274 | utf8mb4_polish_ci collation ignore some characters on string compare | ||
---|---|---|---|
Submitted: | 24 Aug 2022 21:01 | Modified: | 25 Aug 2022 19:19 |
Reporter: | Ian Kovalski | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Server | Severity: | S3 (Non-critical) |
Version: | 5.7.34 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[24 Aug 2022 21:01]
Ian Kovalski
[25 Aug 2022 11:30]
MySQL Verification Team
Hi Mr. Kovalski, Thank you for your bug report. You have chosen case insensitive collation for that column. There is a possibility that in UTFmb4, Polish language is represented by multi-byte characters. In that case, the result might be correct. If you know more about Polish Unicode grapheme cluster scalars, then you can tell what is case sensitive sorting like in Polish Unicode page. We have repeated your test case on the latest 8.0.30 and the results are the same as in 5.7. On our side, we shall have to consult our Unicode expert, which might take some time. We are waiting on your feedback.
[25 Aug 2022 14:44]
Ian Kovalski
There are multibyte characters in Polish language, but I don't see how it relates to the issue. My test case uses valid UTF-8 with only single byte characters. Byte 0 always represent U+0000, and should be always distinct from anything else.
[25 Aug 2022 14:48]
Bernt Marius Johnsen
ASCII 0 is ignorable in utf8mb4_polish_ci so it SHOULD be ignored. Not a bug
[25 Aug 2022 19:19]
Ian Kovalski
For future reference, same applies to all ASCII control codes, and some other codepoints https://unicode.org/reports/tr10/#Ignorables_Defn https://www.unicode.org/Public/UCA/latest/allkeys.txt Thanks for clearing this up.
[26 Aug 2022 11:39]
MySQL Verification Team
Thank you, very much, Bernt Johnsen.