Bug #92899 Incorrect uniqueness determination with equality symbols
Submitted: 23 Oct 2018 13:26 Modified: 25 Oct 2018 14:33
Reporter: Huw Williams Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: DML Severity:S3 (Non-critical)
Version:5.7/8.0 OS:Ubuntu
Assigned to: CPU Architecture:Any
Tags: uniqueness

[23 Oct 2018 13:26] Huw Williams
Description:
The symbol "=" (equal sign) and the symbol "≠" (not equal sign) are incorrectly considered to be the same character when determining uniqueness.

How to repeat:
1. Create a table with a varchar column.
2. Enforce uniqueness on that column.
3. Insert the character "=" into the table.
4. Insert the character "≠" into the table.

Step four fails, with a message that the symbol already exists in the table.
[23 Oct 2018 20:38] Miguel Solorzano
Thank you for the bug report. Repeatable on Linux 5.7 and 8.0 on Windows only 8.0.
[24 Oct 2018 4:22] Shane Bester
imho we need to see the precise character set and collation in use here.
collation is what is used to compare characters afterall...
[24 Oct 2018 4:23] Shane Bester
since the default charset and collation changed you're testing different things on 5.7 vs 8.0 then.
[24 Oct 2018 7:13] tsubasa tanaka
This caused UCA.
At least, I can reproduce it since MySQL 5.0.96 with utf8_unicode_ci.

mysql50> SELECT @@version;
+------------+
| @@version  |
+------------+
| 5.0.96-log |
+------------+
1 row in set (0.01 sec)

mysql50> SELECT '=' = '≠' COLLATE utf8_general_ci;
+-------------------------------------+
| '=' = '≠' COLLATE utf8_general_ci   |
+-------------------------------------+
|                                   0 |
+-------------------------------------+
1 row in set (0.00 sec)

mysql50> SELECT '=' = '≠' COLLATE utf8_unicode_ci;
+-------------------------------------+
| '=' = '≠' COLLATE utf8_unicode_ci   |
+-------------------------------------+
|                                   1 |
+-------------------------------------+
1 row in set (0.00 sec)

From Japanese User Group.
[24 Oct 2018 7:18] tsubasa tanaka
Reference to UserGroup's research "MySQL 5.5.11 unicode_ci" at 2011/04.

http://tmtm.org/tmp/mysql_unicode_collation.html
[24 Oct 2018 11:53] Peter Laursen
Tsubasa's test case reproduces for me on Windows with MySQL 5.7.24 (what is current current according to https://dev.mysql.com/downloads/mysql/)

SELECT '=' = '≠' COLLATE utf8_general_ci; -- 0
SELECT '=' = '≠' COLLATE utf8_unicode_ci; -- 1
SELECT VERSION(); -- 5.7.24-log  

-- Peter
-- not a MySQL/Oracle person
[25 Oct 2018 10:15] Xing Zhang
Posted by developer:
 
Hi Miguel,

What collation you are using to compare the characters for these versions?

I guess that you are comparing the character '=' (U+003D) and the character U+2260 (which I decoded from the hex code in the error message, \xe2\x89\xa0). The collation utf8_general_ci actually compares their code points, so U+003D != U+2260. But the collation utf8_unicode_ci compares them in the UCA way. These two characters have same primary weight, so that utf8_unicode_ci treats them equal, and you got the 'duplicate key' error.

Xing
[25 Oct 2018 11:10] Umesh Shastry
Thank you Xing, for the details. It makes sense now. 

HI Huw,

Please see detailed explanation from Xing in his previous note, it looks like !bg to me.  If you think your issue is different then may I request you to please provide full repeatable test case along with charset/collation details from your environment? Feel free to add it to this bug and change the status back to 'Open'.

Thank you for your interest in MySQL.

regards,
Umesh
[25 Oct 2018 14:33] Huw Williams
Fascinating — thank you for the investigation.  The observed behavior does indeed seem to be consistent with the relevant Unicode collation algorithm.

(That Unicode assigns equivalency to two distinct characters with directly opposite meanings seems like a problem, but one that can be pursued elsewhere.)