MySQL Bugs: #70815: SELECT DISTINCT hashes Unicode space characters incorrectly

Bug #70815	SELECT DISTINCT hashes Unicode space characters incorrectly
Submitted:	5 Nov 2013 3:43	Modified:	21 May 2014 19:00
Reporter:	Arthur O'Dwyer	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S2 (Serious)
Version:	5.5.31, 5.5.34	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Strings which end with trailing spaces that aren't U+0020 --- for example, "U+00A0 NO-BREAK SPACE" --- compare equal with the comparison operators < = >, but hash into different buckets during a SELECT DISTINCT or COUNT DISTINCT.

This may also have ramifications for HASH indexes (e.g., maybe you could make a UNIQUE index that skipped some of the rows in the table and/or allowed duplicate rows to appear as long as they fell in different buckets), but I wasn't able to come up with any exploit in that area.

How to repeat:
create table t (v text collate utf8_unicode_ci);
insert into t values (_binary 0x41C2A0), (_binary 0x41);
select v from t where v=_binary 0x41; -- the rows are equal!
+------+
| A    |
| A    |
+------+

select distinct v from t; -- no, the rows are distinct!
+------+
| A    |
| A    |
+------+

alter table t add unique index k1(v(10));  -- no, the rows are equal!
ERROR 1062 (23000): Duplicate entry '' for key 'k1'

Suggested fix:
There are at least two places in the code where MySQL checks for trailing spaces and forgets about spaces other than U+0020. The relevant one here is:

MY_CHARSET_HANDLER my_charset_utf8_handler=
{
[...]
    my_lengthsp_8bit,   // BUG

But there's also this one:

static size_t
my_lengthsp_mb2(CHARSET_INFO *cs __attribute__((unused)),
                const char *ptr, size_t length)
{
  const char *end= ptr + length;
  while (end > ptr + 1 && end[-1] == ' ' && end[-2] == '\0')   // BUG

related: http://bugs.mysql.com/bug.php?id=69869

Thank you for the report.

Verified as described. Workaround: use CHAR:

mysql> create table t (v char(100) collate utf8_unicode_ci);
Query OK, 0 rows affected (0.33 sec)

mysql> insert into t values (_binary 0x41C2A0), (_binary 0x41);
Query OK, 2 rows affected (0.19 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select concat(v, 'xxx'), hex(v) from t where v=_binary 0x41;
+------------------+--------+
| concat(v, 'xxx') | hex(v) |
+------------------+--------+
| A xxx            | 41C2A0 |
| Axxx             | 41     |
+------------------+--------+
2 rows in set (0.00 sec)

mysql> select v from t where v=_binary 0x41;
+------+
| v    |
+------+
| A    |
| A    |
+------+
2 rows in set (0.00 sec)

mysql> select distinct v from t;
+------+
| v    |
+------+
| A    |
+------+
1 row in set (0.08 sec)

mysql> alter table t add unique index k1(v(10));
ERROR 1062 (23000): Duplicate entry 'A' for key 'k1'

Actually correct SELECT ... WHERE query should be: 

mysql> select v from t where v=convert(_binary 0x41  using UTF8) collate utf8_unicode_ci;
+------+
| v    |
+------+
| A    |
| A    |
+------+
2 rows in set (0.00 sec)

But result is still same.

Queries with a "GROUP BY" are also affected by this bug; we can get two distinct groups in the output corresponding to equivalent inputs.