Bug #69679 | utf8mb4_general_ci considers from u+20000 characters as identical | ||
---|---|---|---|
Submitted: | 5 Jul 2013 17:23 | Modified: | 19 Jun 2014 5:57 |
Reporter: | Arnaud Adant | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Server: Charsets | Severity: | S3 (Non-critical) |
Version: | 5.5.32 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | utf8mb4 charset collation |
[5 Jul 2013 17:23]
Arnaud Adant
[27 Feb 2014 22:51]
Thaddeus Billman
Thanks for this bug report! I couldn't figure why mysql was treating my 4-byte UTF8 characters as indentical (causing all sorts of problems with my db); now I know. Problem persists in version 5.6.16.
[28 Feb 2014 15:33]
Arnaud Adant
Hi Thaddeus, If you are an Oracle MySQL customer, please open a support request. Best regards, Arnaud
[25 May 2014 6:13]
MySQL Verification Team
Please use utf8mb4_unicode_520_ci instead of utf8mb4_general_ci. mysql > select count(distinct c collate utf8mb4_unicode_520_ci), count(distinct c collate utf8mb4_bin), count(*) from test\G *************************** 1. row *************************** count(distinct c collate utf8mb4_unicode_520_ci): 6 count(distinct c collate utf8mb4_bin): 6 count(*): 6 1 row in set (0.00 sec)
[25 May 2014 6:31]
MySQL Verification Team
See this bug for UCA(Unicode Colletion Algorithm). http://bugs.mysql.com/bug.php?id=9604 UCA is described in this page. http://www.unicode.org/reports/tr10/
[19 Jun 2014 5:57]
Erlend Dahl
[17 Jun 2014 4:44] Mithun C Y This is not a bug. Its already stated in the documentation that supplementary characters are all equal to each other in general collations. In the given test case characters are from the supplementary plane. Hence distinct c collate utf8mb4_general_ci will produce 1. Please find the reference link to manual. Link: http://dev.mysql.com/doc/refman/5.6/en/charset-unicode-sets.html