Bug #69679 utf8mb4_general_ci considers from u+20000 characters as identical
Submitted: 5 Jul 2013 17:23 Modified: 19 Jun 2014 5:57
Reporter: Arnaud Adant Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.5.32 OS:Any
Assigned to: CPU Architecture:Any
Tags: utf8mb4 charset collation

[5 Jul 2013 17:23] Arnaud Adant
Description:
Chinese characters from U+20000 (CJK extension B http://graphemica.com/blocks/cjk-unified-ideographs-extension-b) are considered identical by the default utf8mb4 collation, utf8mb4_general_ci.

mysql> insert into test(c) select  convert(unhex('F0A08080') using utf8mb4);
Query OK, 1 row affected (0.03 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> insert into test(c) select  convert(unhex('F0A08081') using utf8mb4);
Query OK, 1 row affected (0.04 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> insert into test(c) select  convert(unhex('F0A08082') using utf8mb4);
Query OK, 1 row affected (0.04 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> insert into test(c) select  convert(unhex('F0A08083') using utf8mb4);
Query OK, 1 row affected (0.04 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> insert into test(c) select  convert(unhex('F0A08084') using utf8mb4);
Query OK, 1 row affected (0.05 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> insert into test(c) select  convert(unhex('F0ABA09D') using utf8mb4);
Query OK, 1 row affected (0.04 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql>
mysql> set names utf8mb4;
Query OK, 0 rows affected (0.00 sec)

mysql> show variables like 'version';
+---------------+--------+
| Variable_name | Value  |
+---------------+--------+
| version       | 5.5.32 |
+---------------+--------+
1 row in set (0.00 sec)

mysql> select count(distinct c collate utf8mb4_general_ci), count(distinct c collate utf8mb4_bin), count(*) from test\G
*************************** 1. row ***************************
count(distinct c collate utf8mb4_general_ci): 1    <======= should be 6
       count(distinct c collate utf8mb4_bin): 6
                                    count(*): 6
1 row in set (0.00 sec)

How to repeat:
drop table if exists test;
create table test(c varchar(1)) engine=InnoDB charset=utf8mb4;

insert into test(c) select  convert(unhex('F0A08080') using utf8mb4);
insert into test(c) select  convert(unhex('F0A08081') using utf8mb4);
insert into test(c) select  convert(unhex('F0A08082') using utf8mb4);
insert into test(c) select  convert(unhex('F0A08083') using utf8mb4);
insert into test(c) select  convert(unhex('F0A08084') using utf8mb4);
insert into test(c) select  convert(unhex('F0ABA09D') using utf8mb4);

set names utf8mb4;
show variables like 'version';
select count(distinct c collate utf8mb4_general_ci), count(distinct c collate utf8mb4_bin), count(*) from test\G

Suggested fix:
Fix the bug. count(distinct c collate utf8mb4_general_ci) should 6 in this case
because the 6 characters are different.
[27 Feb 2014 22:51] Thaddeus Billman
Thanks for this bug report!  I couldn't figure why mysql was treating my 4-byte UTF8 characters as indentical (causing all sorts of problems with my db); now I know.  

Problem persists in version 5.6.16.
[28 Feb 2014 15:33] Arnaud Adant
Hi Thaddeus,

If you are an Oracle MySQL customer, please open a support request.

Best regards,

Arnaud
[25 May 2014 6:13] Meiji Kimura
Please use utf8mb4_unicode_520_ci instead of utf8mb4_general_ci.

mysql > select count(distinct c collate utf8mb4_unicode_520_ci), count(distinct c collate utf8mb4_bin), count(*) from test\G
*************************** 1. row ***************************
count(distinct c collate utf8mb4_unicode_520_ci): 6
           count(distinct c collate utf8mb4_bin): 6
                                        count(*): 6
1 row in set (0.00 sec)
[25 May 2014 6:31] Meiji Kimura
See this bug for UCA(Unicode Colletion Algorithm).

http://bugs.mysql.com/bug.php?id=9604

UCA is described in this page.

http://www.unicode.org/reports/tr10/
[19 Jun 2014 5:57] Erlend Dahl
[17 Jun 2014 4:44] Mithun C Y

This is not a bug. Its already stated in the documentation that supplementary characters are all equal to each other in general collations. In the given test case characters are from the supplementary plane. Hence distinct c collate utf8mb4_general_ci will produce 1.

Please find the reference link to manual.
Link: http://dev.mysql.com/doc/refman/5.6/en/charset-unicode-sets.html