Bug #87700 Strict comparisons of UTF8MB4 Values Fail and resolve to the same 'placeholder'
Submitted: 8 Sep 2017 5:17 Modified: 9 Sep 2017 20:47
Reporter: Shawn Grigson Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.6.37 OS:MacOS ((Running off of Docker, latest 5.6 jessie build))
Assigned to: CPU Architecture:Any

[8 Sep 2017 5:17] Shawn Grigson
Description:
Attempting to do a match between two strings with emojis using the equality operator (or even a strictly equal LIKE statement without wildcards) will fail.

What seems to be happening is that all of the 4th-byte UTF8MB4 characters are resolving to some sort of 'placeholder' behind the scenes. So every emoji essentially is being matched against as the same character.

But the rest of us know that 🍕 != 🍔.

Interestingly, doing a LIKE comparison (involving wildcards) and a REGEXP works properly. There's something afoot with the strict equality comparison logic.

No amount of COLLATE statements, or `SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;` being executed will fix this. This is something wrong in MySQL.

How to repeat:
Running the following sql:

select '🍕 test' = '🍔 test' as comparison1, '🍕 test' = '🌶 test' AS comparison2, '🍕 test' = 'X test' AS comparison3;

Returns a result of 1, 1, 0.

Note that:

select '🍕 test' REGEXP '^(🍔 test)$' as result1, '🍔 test' REGEXP '^(🍔 test)$' as result2;

returns 0, 1 as we would expect.

Suggested fix:
🍕, 🍔, 🌶, ⎋, and all the other emojis and special characters should be treated by strict equality comparisons as distinct characters and not resolve all of them to the same character when doing matching.
[8 Sep 2017 5:36] Shawn Grigson
After further investigation, when I ran `SET NAMES utf8mb4 COLLATE utf8mb4_unicode_520_ci;` prior to the strict equality statement, it worked.

This came up as part of a reference thread suggesting that collation for comparing against emojis.

The problem is that a lot of libraries (mysqljs come to my) simply hardcode in the collation as the default for the charset you specify. (so utf8mb4_general_ci in that case).

I would still say that this seems more like a bug than anything that should be 'expected behavior'.
[8 Sep 2017 9:04] Ryusuke Kajiyama
Similar issue is known as "Sushi-Beer issue"
 https://bugs.mysql.com/bug.php?id=76553

Also, in MySQL Server Blog
 http://mysqlserverteam.com/sushi-beer-an-introduction-of-utf8-support-in-mysql-8-0/
[9 Sep 2017 20:47] MySQL Verification Team
Thank you for the bug report. Duplicate of https://bugs.mysql.com/bug.php?id=76553.