Bug #101311 utf8mb4_unicode_ci issue
Submitted: 26 Oct 2020 6:15 Modified: 27 Oct 2020 13:39
Reporter: kim kyoung youn Email Updates:
Status: Won't fix Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:8.0.22 OS:Any
Assigned to: MySQL Verification Team CPU Architecture:Any
Tags: #collation #characterset

[26 Oct 2020 6:15] kim kyoung youn
Description:
We have confirmed that there is a problem with the collation process of utf8mb4_unicode_ci.
In general, we have seen that MariaDB manages the values of empty space('') and char(0) differently.
After that, as a result of performing the character set/collation change work, in utf8mb4_unicode_ci, the above acronyms were duplicated.
It was confirmed that a duplicate entry'' for key'PRIMARY' error occurred.
Please refer to the test results below.

How to repeat:
mysql> select @@version;
+-----------+
| @@version |
+-----------+
| 8.0.22    |
+-----------+

mysql> create database test;
Query OK, 1 row affected (0.01 sec)

mysql> use test;
Database changed
mysql> 
mysql> 
mysql> 
mysql> create table t1( c1 varchar(100) character set utf8 not null, nm varchar(100) , primary key(c1)) engine=innodb character set utf8;
Query OK, 0 rows affected, 2 warnings (0.01 sec)

mysql> insert into t1 values('','');
Query OK, 1 row affected (0.01 sec)

mysql> insert into t1 values(char(0),'');
Query OK, 1 row affected (0.01 sec)

mysql> select * from t1 where c1='';
+----+------+
| c1 | nm   |
+----+------+
|    |      |
+----+------+
1 row in set (0.00 sec)

mysql> select * from t1 where c1=char(0);
+----+------+
| c1 | nm   |
+----+------+
|    |      |
+----+------+
1 row in set (0.00 sec)

mysql> alter table t1 character set utf8mb4;
Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table t1 modify c1 varchar(100) character set utf8mb4 collate utf8mb4_unicode_ci not null, modify nm varchar(100) character set utf8mb4 collate utf8mb4_unicode_ci ;
ERROR 1062 (23000): Duplicate entry '' for key 't1.PRIMARY'
mysql> alter table t1 modify c1 varchar(100) character set utf8mb4 collate utf8mb4_general_ci not null, modify nm varchar(100) character set utf8mb4 collate utf8mb4_general_ci ;
Query OK, 2 rows affected (0.03 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> ^^;;;;;;;;;;;;;;;;;;;;;

Suggested fix:
Please fix characterset issue  in utf8mb4_unicode_ci.
[26 Oct 2020 12:36] MySQL Verification Team
Hi Mr. youn,

You are using a product that we do not make nor do we support it.

Can you please try your test case with our server.
[26 Oct 2020 14:06] MySQL Verification Team
Hi,

Thanks for the report. While you reported the bug using wrong RDBMS, we too have this issue.

all best
Bogdan
[26 Oct 2020 19:17] Perlover Perlovovich
I found strange collate problem after upgrade mysql-cluster 8.0.21 -> 8.0.22 (ubuntu 18.04)

The priblem was described here:

https://github.com/sidorares/node-mysql2/pull/1235

I cannot reproduce through mysql cli but i can reproduce through mysql2 node package through API

May be this problem was related with this bug
[26 Oct 2020 19:23] Perlover Perlovovich
If to be brief i do through mysql2 API a following statements:

DROP TABLE IF EXISTS asciitest
CREATE TABLE IF NOT EXISTS asciitest (a char (64) ascii not null, b int unsigned not null) character set utf8;
INSERT INTO asciitest (a, b) VALUES ('test', 1);
INSERT INTO asciitest (a, b) VALUES ('super', 2);
SELECT a FROM ${table} WHERE a='test' AND b=2;

And i got an error:

Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8mb4_unicode_ci,COERCIBLE) for operation '='

It should not happen because ascii field should be coercion to unicode characters before a matching through repertoire mechanism i think (https://dev.mysql.com/doc/refman/8.0/en/charset-repertoire.html)
[27 Oct 2020 13:07] Bernt Marius Johnsen
char(0) is an "ignorable" and should not be taken into consideration when two strings are compared.

 MySQL> select char(0) = '' collate utf8mb4_general_ci;
+-----------------------------------------+
| char(0) = '' collate utf8mb4_general_ci |
+-----------------------------------------+
|                                       0 |
+-----------------------------------------+
1 row in set (0.0009 sec)

 MySQL> select concat(char(0),'a') = 'a' collate utf8mb4_unicode_ci;
+------------------------------------------------------+
| concat(char(0),'a') = 'a' collate utf8mb4_unicode_ci |
+------------------------------------------------------+
|                                                    1 |
+------------------------------------------------------+
1 row in set (0.0009 sec)

 MySQL> select char(0) = '' collate utf8mb4_0900_ai_ci;
+-----------------------------------------+
| char(0) = '' collate utf8mb4_0900_ai_ci |
+-----------------------------------------+
|                                       1 |
+-----------------------------------------+
1 row in set (0.0008 sec) 

So this is a bug in utf8mb4_general_ci. utf8mb4_unicode_ci behaves correctly (and so does utf8mb4_0900_ai_ci).

utf8mb4_unicode_ci was implemented as a more correct alternative to utf8mb4_general_ci which has several flaws.
But it does not correct all the flaws, so we recommend users to use utf8mb4_0900_ai_ci which is believed to be quite correct.
[27 Oct 2020 13:39] Norvald Ryeng
Hi,

Thanks for the report! As mentioned by others, this is a bug in utf8mb4_general_ci.

Unfortunately, we can't fix bugs in collations. The reason is that indexes stored on disk are sorted according to collations, so any change we make to a collation will invalidate all indexes using that collation. If we discover a bug in a collation, the only viable solution is to create a new collation that corrects that bug, and then users have to change the collation of their columns and indexes to the new one, if the bug affects them. In this case, the correct and bug free collation already exists: utf8mb4_0900_ai_ci.

Therefore, I'm closing this as "won't fix". The utf8mb4_general_ci collation is known to have bugs, but we can't really do anything about it except ask people not to use it.

In general, we recommend all users to use utf8mb4_0900_ai_ci or another utf8mb4_0900_* collation.