Bug #70815 SELECT DISTINCT hashes Unicode space characters incorrectly
Submitted: 5 Nov 2013 3:43 Modified: 21 May 2014 19:00
Reporter: Arthur O'Dwyer Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:5.5.31, 5.5.34 OS:Any
Assigned to: CPU Architecture:Any

[5 Nov 2013 3:43] Arthur O'Dwyer
Description:
Strings which end with trailing spaces that aren't U+0020 --- for example, "U+00A0 NO-BREAK SPACE" --- compare equal with the comparison operators < = >, but hash into different buckets during a SELECT DISTINCT or COUNT DISTINCT.

This may also have ramifications for HASH indexes (e.g., maybe you could make a UNIQUE index that skipped some of the rows in the table and/or allowed duplicate rows to appear as long as they fell in different buckets), but I wasn't able to come up with any exploit in that area.

How to repeat:
create table t (v text collate utf8_unicode_ci);
insert into t values (_binary 0x41C2A0), (_binary 0x41);
select v from t where v=_binary 0x41; -- the rows are equal!
+------+
| A    |
| A    |
+------+

select distinct v from t; -- no, the rows are distinct!
+------+
| A    |
| A    |
+------+

alter table t add unique index k1(v(10));  -- no, the rows are equal!
ERROR 1062 (23000): Duplicate entry '' for key 'k1'

Suggested fix:
There are at least two places in the code where MySQL checks for trailing spaces and forgets about spaces other than U+0020. The relevant one here is:

MY_CHARSET_HANDLER my_charset_utf8_handler=
{
[...]
    my_lengthsp_8bit,   // BUG

But there's also this one:

static size_t
my_lengthsp_mb2(CHARSET_INFO *cs __attribute__((unused)),
                const char *ptr, size_t length)
{
  const char *end= ptr + length;
  while (end > ptr + 1 && end[-1] == ' ' && end[-2] == '\0')   // BUG
[5 Nov 2013 5:52] MySQL Verification Team
related: http://bugs.mysql.com/bug.php?id=69869
[5 Nov 2013 12:13] Sveta Smirnova
Thank you for the report.

Verified as described. Workaround: use CHAR:

mysql> create table t (v char(100) collate utf8_unicode_ci);
Query OK, 0 rows affected (0.33 sec)

mysql> insert into t values (_binary 0x41C2A0), (_binary 0x41);
Query OK, 2 rows affected (0.19 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select concat(v, 'xxx'), hex(v) from t where v=_binary 0x41;
+------------------+--------+
| concat(v, 'xxx') | hex(v) |
+------------------+--------+
| A xxx            | 41C2A0 |
| Axxx             | 41     |
+------------------+--------+
2 rows in set (0.00 sec)

mysql> select v from t where v=_binary 0x41;
+------+
| v    |
+------+
| A    |
| A    |
+------+
2 rows in set (0.00 sec)

mysql> select distinct v from t;
+------+
| v    |
+------+
| A    |
+------+
1 row in set (0.08 sec)

mysql> alter table t add unique index k1(v(10));
ERROR 1062 (23000): Duplicate entry 'A' for key 'k1'
[5 Nov 2013 16:03] Sveta Smirnova
Actually correct SELECT ... WHERE query should be: 

mysql> select v from t where v=convert(_binary 0x41  using UTF8) collate utf8_unicode_ci;
+------+
| v    |
+------+
| A    |
| A    |
+------+
2 rows in set (0.00 sec)

But result is still same.
[21 May 2014 19:00] Arthur O'Dwyer
Queries with a "GROUP BY" are also affected by this bug; we can get two distinct groups in the output corresponding to equivalent inputs.