| Bug #70815 | SELECT DISTINCT hashes Unicode space characters incorrectly | ||
|---|---|---|---|
| Submitted: | 5 Nov 2013 3:43 | Modified: | 21 May 2014 19:00 |
| Reporter: | Arthur O'Dwyer | Email Updates: | |
| Status: | Verified | Impact on me: | |
| Category: | MySQL Server: Charsets | Severity: | S2 (Serious) |
| Version: | 5.5.31, 5.5.34 | OS: | Any |
| Assigned to: | CPU Architecture: | Any | |
[5 Nov 2013 5:52]
MySQL Verification Team
related: http://bugs.mysql.com/bug.php?id=69869
[5 Nov 2013 12:13]
Sveta Smirnova
Thank you for the report. Verified as described. Workaround: use CHAR: mysql> create table t (v char(100) collate utf8_unicode_ci); Query OK, 0 rows affected (0.33 sec) mysql> insert into t values (_binary 0x41C2A0), (_binary 0x41); Query OK, 2 rows affected (0.19 sec) Records: 2 Duplicates: 0 Warnings: 0 mysql> select concat(v, 'xxx'), hex(v) from t where v=_binary 0x41; +------------------+--------+ | concat(v, 'xxx') | hex(v) | +------------------+--------+ | A xxx | 41C2A0 | | Axxx | 41 | +------------------+--------+ 2 rows in set (0.00 sec) mysql> select v from t where v=_binary 0x41; +------+ | v | +------+ | A | | A | +------+ 2 rows in set (0.00 sec) mysql> select distinct v from t; +------+ | v | +------+ | A | +------+ 1 row in set (0.08 sec) mysql> alter table t add unique index k1(v(10)); ERROR 1062 (23000): Duplicate entry 'A' for key 'k1'
[5 Nov 2013 16:03]
Sveta Smirnova
Actually correct SELECT ... WHERE query should be: mysql> select v from t where v=convert(_binary 0x41 using UTF8) collate utf8_unicode_ci; +------+ | v | +------+ | A | | A | +------+ 2 rows in set (0.00 sec) But result is still same.
[21 May 2014 19:00]
Arthur O'Dwyer
Queries with a "GROUP BY" are also affected by this bug; we can get two distinct groups in the output corresponding to equivalent inputs.

Description: Strings which end with trailing spaces that aren't U+0020 --- for example, "U+00A0 NO-BREAK SPACE" --- compare equal with the comparison operators < = >, but hash into different buckets during a SELECT DISTINCT or COUNT DISTINCT. This may also have ramifications for HASH indexes (e.g., maybe you could make a UNIQUE index that skipped some of the rows in the table and/or allowed duplicate rows to appear as long as they fell in different buckets), but I wasn't able to come up with any exploit in that area. How to repeat: create table t (v text collate utf8_unicode_ci); insert into t values (_binary 0x41C2A0), (_binary 0x41); select v from t where v=_binary 0x41; -- the rows are equal! +------+ | A | | A | +------+ select distinct v from t; -- no, the rows are distinct! +------+ | A | | A | +------+ alter table t add unique index k1(v(10)); -- no, the rows are equal! ERROR 1062 (23000): Duplicate entry '' for key 'k1' Suggested fix: There are at least two places in the code where MySQL checks for trailing spaces and forgets about spaces other than U+0020. The relevant one here is: MY_CHARSET_HANDLER my_charset_utf8_handler= { [...] my_lengthsp_8bit, // BUG But there's also this one: static size_t my_lengthsp_mb2(CHARSET_INFO *cs __attribute__((unused)), const char *ptr, size_t length) { const char *end= ptr + length; while (end > ptr + 1 && end[-1] == ' ' && end[-2] == '\0') // BUG