MySQL Bugs: #87700: Strict comparisons of UTF8MB4 Values Fail and resolve to the same 'placeholder'

Bug #87700	Strict comparisons of UTF8MB4 Values Fail and resolve to the same 'placeholder'
Submitted:	8 Sep 2017 5:17	Modified:	9 Sep 2017 20:47
Reporter:	Shawn Grigson	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S3 (Non-critical)
Version:	5.6.37	OS:	MacOS ((Running off of Docker, latest 5.6 jessie build))
Assigned to:		CPU Architecture:	Any

Description:
Attempting to do a match between two strings with emojis using the equality operator (or even a strictly equal LIKE statement without wildcards) will fail.

What seems to be happening is that all of the 4th-byte UTF8MB4 characters are resolving to some sort of 'placeholder' behind the scenes. So every emoji essentially is being matched against as the same character.

But the rest of us know that 🍕 != 🍔.

Interestingly, doing a LIKE comparison (involving wildcards) and a REGEXP works properly. There's something afoot with the strict equality comparison logic.

No amount of COLLATE statements, or `SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;` being executed will fix this. This is something wrong in MySQL.

How to repeat:
Running the following sql:

select '🍕 test' = '🍔 test' as comparison1, '🍕 test' = '🌶 test' AS comparison2, '🍕 test' = 'X test' AS comparison3;

Returns a result of 1, 1, 0.

Note that:

select '🍕 test' REGEXP '^(🍔 test)$' as result1, '🍔 test' REGEXP '^(🍔 test)$' as result2;

returns 0, 1 as we would expect.

Suggested fix:
🍕, 🍔, 🌶, ⎋, and all the other emojis and special characters should be treated by strict equality comparisons as distinct characters and not resolve all of them to the same character when doing matching.

After further investigation, when I ran `SET NAMES utf8mb4 COLLATE utf8mb4_unicode_520_ci;` prior to the strict equality statement, it worked.

This came up as part of a reference thread suggesting that collation for comparing against emojis.

The problem is that a lot of libraries (mysqljs come to my) simply hardcode in the collation as the default for the charset you specify. (so utf8mb4_general_ci in that case).

I would still say that this seems more like a bug than anything that should be 'expected behavior'.

Similar issue is known as "Sushi-Beer issue"
 https://bugs.mysql.com/bug.php?id=76553

Also, in MySQL Server Blog
 http://mysqlserverteam.com/sushi-beer-an-introduction-of-utf8-support-in-mysql-8-0/

Thank you for the bug report. Duplicate of https://bugs.mysql.com/bug.php?id=76553.