Bug #111828 Collation utf8mb4_0900_ai_ci doesn't contain the default contraction in DUCET
Submitted: 20 Jul 2023 12:20 Modified: 20 Jul 2023 14:08
Reporter: Yang Keao Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Charsets Severity:S4 (Feature request)
Version:8.1 OS:Any
Assigned to: CPU Architecture:Any

[20 Jul 2023 12:20] Yang Keao
Description:
Collation `utf8mb4_0900_ai_ci` declares that it's based on "UCA 9.0.0 weight keys". However, the default behavior is not compatible with UCA and DUCET.

The DUCET contains several contractions (e.g. some Thai characters) and some of these contractions are not contained in any supported languages specific collation.

For example, `select _utf8mb4 x'E1A6B5E1A681' > _utf8mb4 x'E1A6B6E1A680';` and `select 'ᦵᦁ' >= 'ᦺᦀ'` all return 0, but they should be 1 for UCA 9.0.0. As a reference, in MariaDB 10.10, which supports UCA 14.0.0, it gives 1 for both of them (with `collate uca1400_ai_ci`).

(You can search '19B5 1981' and '19B6 1980' in https://www.unicode.org/Public/UCA/9.0.0/allkeys.txt)

How to repeat:
Run `select _utf8mb4 x'E1A6B5E1A681' > _utf8mb4 x'E1A6B6E1A680' collate utf8mb4_0900_ai_ci;`, and it returns 0.

Or `select 'ᦵᦁ' >= 'ᦺᦀ'`, it also gives 0.

Suggested fix:
Support the contraction defined in the DUCET.
[20 Jul 2023 13:56] MySQL Verification Team
Hi Mr. Keao,

Thank you for your bug report.

However, we do not consider it a bug.

But, it would make a very nice feature request.

If you agree , we shall verify it as a feature request and send it to our Development team.

Waiting on your feedback.
[20 Jul 2023 13:56] MySQL Verification Team
Entering the correct version.
[20 Jul 2023 14:02] Yang Keao
> If you agree , we shall verify it as a feature request and send it to our Development team.

Good. Thank you! I have modified this submission to S4 (Feature request).
[20 Jul 2023 14:08] MySQL Verification Team
Hi Mr. Keao,

This is now a verified feature request.

It was copied into our internal bugs database, so that that Dev team in charge can consider it.

Verified.
[21 Jul 2023 9:28] Bernt Marius Johnsen
Posted by developer:
 
"For non-language-specific collations, characters in contraction sequences are treated as separate characters. For language-specific collations, contractions might change character sorting order." (https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html)
[21 Jul 2023 12:22] MySQL Verification Team
Thank you, Bernt.