Description:
This bug is found in the following collations:
utf8mb4_et_800_ci_ai Estonian
utf8mb4_is_800_ci_ai Icelandic
utf8mb4_pl_800_ci_ai Polish
utf8mb4_ro_800_ci_ai Romanian
utf8mb4_sk_800_ci_ai Slovakian
utf8mb4_sl_800_ci_ai Slovenian
utf8mb4_sv_800_ci_ai Swedish
utf8mb4_vi_800_ci_ai Vietnamese
For the illistration of teh bug we chose Estonian
(utf8mb4_et_800_ci_ai) where it is only related to the letters
U+1EE0;LATIN CAPITAL LETTER O WITH HORN AND TILDE
U+1EE1;LATIN SMALL LETTER O WITH HORN AND TILDE
(And therefore the name title of the bug report)
Estonian has the rule (among others)
X<0x00F5<<<0x00D5 (Hex codes for O WITH TILDE in small and capital).
Given the folloiwng entries in UnicodeData.txt:
00D5;LATIN CAPITAL LETTER O WITH TILDE;Lu;0;L;004F 0303;;;;N;LATIN CAPITAL LETTER O TILDE;;;00F5;
01A0;LATIN CAPITAL LETTER O WITH HORN;Lu;0;L;004F 031B;;;;N;LATIN CAPITAL LETTER O HORN;;;01A1;
0303;COMBINING TILDE;Mn;230;NSM;;;;;N;NON-SPACING TILDE;;;;
031B;COMBINING HORN;Mn;216;NSM;;;;;N;NON-SPACING HORN;;;;
1EE0;LATIN CAPITAL LETTER O WITH HORN AND TILDE;Lu;0;L;01A0 0303;;;;N;;;;1EE1;
We find that
U+1EE0 may be (recursively) decomposed to U+004F U+031B U+0303
and
U+00D5 may be decomposed to U+004F U+0303
And since the combining letters U+0303 and U+031B have different
combing class (respectively 230 and 216).
U+004F U+031B U+0303 is equivalent to U+004F U+303 U+031B which again may be composed to
U+00D5 U+31B
We see that "O WITH HORN AND TILDE" must be interpreted as an "O WITH
TILDE" with an additional horn and in a *_ci_ai collation the last
accent must be ignored and therefor
0x1EE0 should collate as 0x00D5 and not as 0x004F ('O').
The following is therefore wrong:
mysql> select convert(_utf16 0x1ee0 using utf8mb4) = convert(_utf16 0x00D5 using utf8mb4) collate utf8mb4_et_800_ci_ai;
+----------------------------------------------------------------------------------------------------------+
| convert(_utf16 0x1ee0 using utf8mb4) = convert(_utf16 0x00D5 using utf8mb4) collate utf8mb4_et_800_ci_ai |
+----------------------------------------------------------------------------------------------------------+
| 0 |
+----------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
How to repeat:
select convert(_utf16 0x1ee0 using utf8mb4) = convert(_utf16 0x00D5 using utf8mb4) collate utf8mb4_et_800_ci_ai;