Bug #81575 WL#9108: O with horn and tilde
Submitted: 24 May 2016 13:55 Modified: 16 Jun 2016 5:56
Reporter: Bernt Marius Johnsen Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:8.0 OS:Any
Assigned to: CPU Architecture:Any

[24 May 2016 13:55] Bernt Marius Johnsen
Description:
This bug is found in the following collations:

utf8mb4_et_800_ci_ai Estonian
utf8mb4_is_800_ci_ai Icelandic
utf8mb4_pl_800_ci_ai Polish
utf8mb4_ro_800_ci_ai Romanian
utf8mb4_sk_800_ci_ai Slovakian
utf8mb4_sl_800_ci_ai Slovenian
utf8mb4_sv_800_ci_ai Swedish
utf8mb4_vi_800_ci_ai Vietnamese

For the illistration of teh bug we chose Estonian
(utf8mb4_et_800_ci_ai) where it is only related to the letters

U+1EE0;LATIN CAPITAL LETTER O WITH HORN AND TILDE
U+1EE1;LATIN SMALL LETTER O WITH HORN AND TILDE

(And therefore the name title of the bug report)

Estonian has the rule (among others)

X<0x00F5<<<0x00D5 (Hex codes for O WITH TILDE in small and capital).

Given the folloiwng entries in UnicodeData.txt:

00D5;LATIN CAPITAL LETTER O WITH TILDE;Lu;0;L;004F 0303;;;;N;LATIN CAPITAL LETTER O TILDE;;;00F5;
01A0;LATIN CAPITAL LETTER O WITH HORN;Lu;0;L;004F 031B;;;;N;LATIN CAPITAL LETTER O HORN;;;01A1;
0303;COMBINING TILDE;Mn;230;NSM;;;;;N;NON-SPACING TILDE;;;;
031B;COMBINING HORN;Mn;216;NSM;;;;;N;NON-SPACING HORN;;;;
1EE0;LATIN CAPITAL LETTER O WITH HORN AND TILDE;Lu;0;L;01A0 0303;;;;N;;;;1EE1;

We find that

U+1EE0 may be (recursively) decomposed to U+004F U+031B U+0303
and
U+00D5 may be decomposed to U+004F U+0303

And since the combining letters U+0303 and U+031B have different
combing class (respectively 230 and 216).

U+004F U+031B U+0303 is equivalent to U+004F U+303 U+031B which again may be composed to
U+00D5 U+31B

We see that "O WITH HORN AND TILDE" must be interpreted as an "O WITH
TILDE" with an additional horn and in a *_ci_ai collation the last
accent must be ignored and therefor

0x1EE0 should collate as 0x00D5 and not as 0x004F ('O').

The following is therefore wrong:

mysql> select convert(_utf16 0x1ee0 using utf8mb4) = convert(_utf16 0x00D5 using utf8mb4) collate utf8mb4_et_800_ci_ai;
+----------------------------------------------------------------------------------------------------------+
| convert(_utf16 0x1ee0 using utf8mb4) = convert(_utf16 0x00D5 using utf8mb4) collate utf8mb4_et_800_ci_ai |
+----------------------------------------------------------------------------------------------------------+
|                                                                                                        0 |
+----------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

How to repeat:
select convert(_utf16 0x1ee0 using utf8mb4) = convert(_utf16 0x00D5 using utf8mb4) collate utf8mb4_et_800_ci_ai;
[18 Jun 2016 21:38] Omer Barnir
Posted by developer:
 
Reported version value updated to reflect release name change from 5.8 to 8.0
[23 Nov 2016 14:41] Paul DuBois
Fixed in 8.0.0.

Bug affects no released version. No changelog entry needed.