MySQL Bugs: #83549: String hashing gives wrong result for case-insensitive Unicode collations

Bug #83549	String hashing gives wrong result for case-insensitive Unicode collations
Submitted:	26 Oct 2016 13:31	Modified:	15 Dec 2016 17:35
Reporter:	Steinar Gunderson	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S3 (Non-critical)
Version:	8.0.1	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
In case-insensitive Unicode collations, all kinds of space (SPACE, NON-BREAKING SPACE, FIGURE SPACE, EM SPACE, etc.) should compare equal. However, they currently don't hash equal; we strip spaces from the right before hashing, but only 0x20 spaces, not the other kinds of space.

How to repeat:
The simplest way to reproduce is to look at the MEMORY engine, since it is dependent on hashing for equality:

mysql> CREATE TABLE t1 ( c VARCHAR(255) UNIQUE ) ENGINE=MEMORY COLLATE utf8mb4_0900_ai_ci;
Query OK, 0 rows affected (0,02 sec)

mysql> INSERT INTO t1 VALUES (' ');
Query OK, 1 row affected (0,00 sec)

mysql> INSERT INTO t2 VALUES (_utf16 0x00a0);
Query OK, 1 row affected (0,00 sec)

Compare with InnoDB, which doesn't rely on hashing:

mysql> CREATE TABLE t2 ( c VARCHAR(255) UNIQUE ) ENGINE=InnoDB COLLATE utf8mb4_0900_ai_ci;
Query OK, 0 rows affected (0,01 sec)

mysql> INSERT INTO t2 VALUES (' ');
Query OK, 1 row affected (0,00 sec)

mysql> INSERT INTO t2 VALUES (_utf16 0x00a0);
ERROR 1062 (23000): Duplicate entry ' ' for key 'c'

(To be clear, this is _not_ a defect in the MEMORY engine; it's just the simplest way to demonstrate the issue.)

Suggested fix:
strip_trailing_space() needs to strip all kinds of space, not just 0x20, from the end of the string. It's probably fine to strip all 0x20 quickly first, as we already do, but after that, it needs to parse Unicode characters from the back of the string (depending on the character set), look them up in the weight table and check if they have equal weight to space in the given collation (0x0209 for UCA 9.0.0). This is probably slow, but the common case is that the very first character we look at will be a non-space (and most likely, even a non-space ASCII), so we'll get an early exit.

Posted by developer:
 
Noted in 8.0.1 changelog.

For case-insensitive Unicode collations, the various space characters
did not hash to the same value, resulting in incorrect comparisons
between them.