MySQL Bugs: #69679: utf8mb4_general_ci considers from u+20000 characters as identical

Bug #69679	utf8mb4_general_ci considers from u+20000 characters as identical
Submitted:	5 Jul 2013 17:23	Modified:	19 Jun 2014 5:57
Reporter:	Arnaud Adant	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S3 (Non-critical)
Version:	5.5.32	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	utf8mb4 charset collation

Description:
Chinese characters from U+20000 (CJK extension B http://graphemica.com/blocks/cjk-unified-ideographs-extension-b) are considered identical by the default utf8mb4 collation, utf8mb4_general_ci.

mysql> insert into test(c) select  convert(unhex('F0A08080') using utf8mb4);
Query OK, 1 row affected (0.03 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> insert into test(c) select  convert(unhex('F0A08081') using utf8mb4);
Query OK, 1 row affected (0.04 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> insert into test(c) select  convert(unhex('F0A08082') using utf8mb4);
Query OK, 1 row affected (0.04 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> insert into test(c) select  convert(unhex('F0A08083') using utf8mb4);
Query OK, 1 row affected (0.04 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> insert into test(c) select  convert(unhex('F0A08084') using utf8mb4);
Query OK, 1 row affected (0.05 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> insert into test(c) select  convert(unhex('F0ABA09D') using utf8mb4);
Query OK, 1 row affected (0.04 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql>
mysql> set names utf8mb4;
Query OK, 0 rows affected (0.00 sec)

mysql> show variables like 'version';
+---------------+--------+
| Variable_name | Value  |
+---------------+--------+
| version       | 5.5.32 |
+---------------+--------+
1 row in set (0.00 sec)

mysql> select count(distinct c collate utf8mb4_general_ci), count(distinct c collate utf8mb4_bin), count(*) from test\G
*************************** 1. row ***************************
count(distinct c collate utf8mb4_general_ci): 1    <======= should be 6
       count(distinct c collate utf8mb4_bin): 6
                                    count(*): 6
1 row in set (0.00 sec)

How to repeat:
drop table if exists test;
create table test(c varchar(1)) engine=InnoDB charset=utf8mb4;

insert into test(c) select  convert(unhex('F0A08080') using utf8mb4);
insert into test(c) select  convert(unhex('F0A08081') using utf8mb4);
insert into test(c) select  convert(unhex('F0A08082') using utf8mb4);
insert into test(c) select  convert(unhex('F0A08083') using utf8mb4);
insert into test(c) select  convert(unhex('F0A08084') using utf8mb4);
insert into test(c) select  convert(unhex('F0ABA09D') using utf8mb4);

set names utf8mb4;
show variables like 'version';
select count(distinct c collate utf8mb4_general_ci), count(distinct c collate utf8mb4_bin), count(*) from test\G

Suggested fix:
Fix the bug. count(distinct c collate utf8mb4_general_ci) should 6 in this case
because the 6 characters are different.

Thanks for this bug report!  I couldn't figure why mysql was treating my 4-byte UTF8 characters as indentical (causing all sorts of problems with my db); now I know.  

Problem persists in version 5.6.16.

Hi Thaddeus,

If you are an Oracle MySQL customer, please open a support request.

Best regards,

Arnaud

Please use utf8mb4_unicode_520_ci instead of utf8mb4_general_ci.

mysql > select count(distinct c collate utf8mb4_unicode_520_ci), count(distinct c collate utf8mb4_bin), count(*) from test\G
*************************** 1. row ***************************
count(distinct c collate utf8mb4_unicode_520_ci): 6
           count(distinct c collate utf8mb4_bin): 6
                                        count(*): 6
1 row in set (0.00 sec)

See this bug for UCA(Unicode Colletion Algorithm).

http://bugs.mysql.com/bug.php?id=9604

UCA is described in this page.

http://www.unicode.org/reports/tr10/

[17 Jun 2014 4:44] Mithun C Y

This is not a bug. Its already stated in the documentation that supplementary characters are all equal to each other in general collations. In the given test case characters are from the supplementary plane. Hence distinct c collate utf8mb4_general_ci will produce 1.

Please find the reference link to manual.
Link: http://dev.mysql.com/doc/refman/5.6/en/charset-unicode-sets.html