Description:
In case-insensitive Unicode collations, all kinds of space (SPACE, NON-BREAKING SPACE, FIGURE SPACE, EM SPACE, etc.) should compare equal. However, they currently don't hash equal; we strip spaces from the right before hashing, but only 0x20 spaces, not the other kinds of space.
How to repeat:
The simplest way to reproduce is to look at the MEMORY engine, since it is dependent on hashing for equality:
mysql> CREATE TABLE t1 ( c VARCHAR(255) UNIQUE ) ENGINE=MEMORY COLLATE utf8mb4_0900_ai_ci;
Query OK, 0 rows affected (0,02 sec)
mysql> INSERT INTO t1 VALUES (' ');
Query OK, 1 row affected (0,00 sec)
mysql> INSERT INTO t2 VALUES (_utf16 0x00a0);
Query OK, 1 row affected (0,00 sec)
Compare with InnoDB, which doesn't rely on hashing:
mysql> CREATE TABLE t2 ( c VARCHAR(255) UNIQUE ) ENGINE=InnoDB COLLATE utf8mb4_0900_ai_ci;
Query OK, 0 rows affected (0,01 sec)
mysql> INSERT INTO t2 VALUES (' ');
Query OK, 1 row affected (0,00 sec)
mysql> INSERT INTO t2 VALUES (_utf16 0x00a0);
ERROR 1062 (23000): Duplicate entry ' ' for key 'c'
(To be clear, this is _not_ a defect in the MEMORY engine; it's just the simplest way to demonstrate the issue.)
Suggested fix:
strip_trailing_space() needs to strip all kinds of space, not just 0x20, from the end of the string. It's probably fine to strip all 0x20 quickly first, as we already do, but after that, it needs to parse Unicode characters from the back of the string (depending on the character set), look them up in the weight table and check if they have equal weight to space in the given collation (0x0209 for UCA 9.0.0). This is probably slow, but the common case is that the very first character we look at will be a non-space (and most likely, even a non-space ASCII), so we'll get an early exit.