Bug #76553 Sushi-Beer issue of MySQL with utf8mb4
Submitted: 1 Apr 2015 6:15 Modified: 19 Sep 2017 12:34
Reporter: Ryuta Kamizono Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:5.6.23, 5.7.6, 5.6.24 OS:Any
Assigned to: CPU Architecture:Any

[1 Apr 2015 6:15] Ryuta Kamizono
Description:
utf8mb4 character set treats Sushi Emoji (U+1F363) and Beer Emoji
(U+1F37A) as same characters, when using utf8mb4_general_ci or
utf8mb4_unicode_ci. Because both collations are treating same weight
0xfffd for Emoji. This issue is not limited to Emoji, but possible to
all SMP characters.

To treat these Emoji as different characters, either utf8mb4_bin or
utf8mb4_unicode_520_ci should be used. However, utf8mb4_unicode_520_ci
has another issue, so called Haha-Papa issue means Mother-Father issue
in Japanese. "ハ" (U+30CF KATAKANA LETTER HA), "パ" (U+30D1 KATAKANA
LETTER PA), and "バ" (U+30D0 KATAKANA LETTER BA) can not be recognized
different characters.

==UCA 5.2.0==
30CF  ; [.2B2E.0020.0011.30CF] # KATAKANA LETTER HA
30D0  ; [.2B2E.0020.0011.30CF][.0000.0148.0002.3099] # KATAKANA LETTER BA; QQCM
30D1  ; [.2B2E.0020.0011.30CF][.0000.0149.0002.309A] # KATAKANA LETTER PA; QQCM
====

Related info:
http://blog.kamipo.net/entry/2015/03/23/093052
http://www.unicode.org/Public/UCA/5.2.0/allkeys.txt
http://en.wikipedia.org/wiki/Dakuten
http://en.wikipedia.org/wiki/Miscellaneous_Symbols_and_Pictographs

How to repeat:
root@localhost [sushi] > SET NAMES utf8mb4;
Query OK, 0 rows affected (0.00 sec)

root@localhost [sushi] > CREATE TABLE wishlist_general (
    ->   item varchar(1)
    -> ) COLLATE utf8mb4_general_ci;
Query OK, 0 rows affected (0.02 sec)

root@localhost [sushi] > CREATE TABLE wishlist_unicode (
    ->   item varchar(1)
    -> ) COLLATE utf8mb4_unicode_ci;
Query OK, 0 rows affected (0.01 sec)

root@localhost [sushi] > INSERT INTO wishlist_general VALUES ('

Suggested fix:
- utf8mb4_general_ci should use code point for SMP characters
- utf8mb4_unicode_ci should use weight calculation for SMP characters
- utf8mb4_japanese_ci is needed, based on utf8mb4_unicode_520_ci,
  which recognized difference characters, with Dakuten aka voicing mark,
  Handakuten aka semi-voicing mark, and characters without these.
  e.g.
    "ハ" (U+30CF) KATAKANA LETTER HA
    "バ" (U+30D0) KATAKANA LETTER BA, HA + Dakuten
    "パ" (U+30D1) KATAKANA LETTER PA, HA + Handakuten
[1 Apr 2015 6:28] Ryuta Kamizono
The posted contents after Emoji had been truncated...

Please refer to the link for how to reproduce:

https://gist.github.com/kamipo/37576ce436c564d8cc28
[22 Apr 2015 9:55] Ryuta Kamizono
What do you think about this issue?
[22 Apr 2015 11:17] Umesh Shastry
Hello Ryuta Kamizono,

Thank you for the bug report and test case.

Thanks,
Umesh
[20 Sep 2015 14:12] Daniël van Eeden
The query against the different tables. One row is expected.
utf8mb_bin = OK
utf8mb4_unicode_520_ci = OK
utf8mb4_unicode_ci = NOT OK
utf8mb4_general_ci = NOT OK

mysql [localhost] {msandbox} (sushibeer) > SELECT HEX(item), HEX(WEIGHT_STRING(item)) FROM wishlist_unicode_520 WHERE item = X'F09F8DA3';
+-----------+--------------------------+
| HEX(item) | HEX(WEIGHT_STRING(item)) |
+-----------+--------------------------+
| F09F8DA3  | FBC3F363                 |
+-----------+--------------------------+
1 row in set (0.00 sec)

mysql [localhost] {msandbox} (sushibeer) > SELECT HEX(item), HEX(WEIGHT_STRING(item)) FROM wishlist_unicode WHERE item = X'F09F8DA3';
+-----------+--------------------------+
| HEX(item) | HEX(WEIGHT_STRING(item)) |
+-----------+--------------------------+
| F09F8DA3  | FFFD                     |
| F09F8DBA  | FFFD                     |
+-----------+--------------------------+
2 rows in set (0.00 sec)

mysql [localhost] {msandbox} (sushibeer) > SELECT HEX(item), HEX(WEIGHT_STRING(item)) FROM wishlist_general WHERE item = X'F09F8DA3';
+-----------+--------------------------+
| HEX(item) | HEX(WEIGHT_STRING(item)) |
+-----------+--------------------------+
| F09F8DA3  | FFFD                     |
| F09F8DBA  | FFFD                     |
+-----------+--------------------------+
2 rows in set (0.00 sec)

mysql [localhost] {msandbox} (sushibeer) > SELECT HEX(item), HEX(WEIGHT_STRING(item)) FROM wishlist_bin WHERE item = X'F09F8DA3';
+-----------+--------------------------+
| HEX(item) | HEX(WEIGHT_STRING(item)) |
+-----------+--------------------------+
| F09F8DA3  | 01F363                   |
+-----------+--------------------------+
1 row in set (0.00 sec)
[15 Jan 2017 21:59] Manyi Lu
See also:
http://mysqlserverteam.com/sushi-beer-an-introduction-of-utf8-support-in-mysql-8-0/
[9 Sep 2017 20:47] Miguel Solorzano
https://bugs.mysql.com/bug.php?id=87700 marked as duplicate of this one.
[19 Sep 2017 12:34] Manyi Lu
Posted by developer:
 
New utf8mb4 collations in 8.0 solved the problems listed in this bug report.