Bug #118813 utf8mb4_0900_ai_ci check different string mark as duplicate
Submitted: 12 Aug 3:20 Modified: 21 Aug 12:49
Reporter: peng gao Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: DML Severity:S3 (Non-critical)
Version:8.0.41, 8.0.43, 8.4.6 OS:Any
Assigned to: CPU Architecture:Any

[12 Aug 3:20] peng gao
Description:
Hi team:
    
    collation when use utf8mb4_0900_ai_ci, two sting "Ⅰ" and "I"  check dup.
    when dup check,"Ⅰ" length is 3 , "I" length is 1,as follows
    
    cmp_data (mtype=12, prtype=16711695, is_asc=true, data1=0x7fff5c1fef1a "Ⅰ", len1=3, data2=0x7fffcda3807f "I", len2=1)
   
    Is this normal?  thanks

How to repeat:
create table mytest(a varchar(20),unique key(a));

insert into mytest values(0x49);
insert into mytest values(0xE285A0);
ERROR 1062 (23000): Duplicate entry 'Ⅰ' for key 'testi.a'
[12 Aug 3:22] peng gao
mysql> show create table mytest \G
*************************** 1. row ***************************
       Table: mytest
Create Table: CREATE TABLE `mytest` (
  `a` varchar(20) DEFAULT NULL,
  UNIQUE KEY `a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
[12 Aug 5:00] MySQL Verification Team
Hello peng gao,

Thank you for the report and feedback.

regards,
Umesh
[21 Aug 12:49] Bernt Marius Johnsen
Posted by developer:
 
The letter 'Ⅰ' is "U+2160 ROMAN NUMERAL ONE" and defined to be equal to 'I' (U+0049 LATIN CAPITAL LETTER I) in the Unicode standard.

Similarly, 'Ⅱ' ("U+2161 ROMAN NUMERAL TOW") is defined to be equal to 'II':

mysql> select 'Ⅰ' = 'I';
+-------------+
| 'Ⅰ' = 'I'   |
+-------------+
|           1 |
+-------------+
1 row in set (0.000 sec)

mysql> select 'Ⅱ' = 'II';
+--------------+
| 'Ⅱ' = 'II'   |
+--------------+
|            1 |
+--------------+
1 row in set (0.000 sec)

So this is correct behavior according to the Unicode standard, and not a bug.