Bug #71563 | Handling of combining characters. | ||
---|---|---|---|
Submitted: | 3 Feb 2014 11:26 | Modified: | 8 Dec 2017 14:02 |
Reporter: | Daniël van Eeden (OCA) | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: Charsets | Severity: | S4 (Feature request) |
Version: | 5.6.15, 5.7.9, 8.0 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | nfc, normalize, sql:2008, STANDARDS, Unicode, utf8 |
[3 Feb 2014 11:26]
Daniël van Eeden
[4 Feb 2014 12:57]
MySQL Verification Team
Thank you for the bug report. Please provide the create table/insert data commands. Thanks.
[4 Feb 2014 13:37]
Daniël van Eeden
mysql> CREATE TABLE `t1` (`name` varchar(100)) DEFAULT CHARSET=utf8; Query OK, 0 rows affected (0.01 sec) mysql> INSERT INTO t1 VALUES(UNHEX('44616E69C3AB6C')),(UNHEX('44616E6965CC886C')); Query OK, 2 rows affected (0.00 sec) Records: 2 Duplicates: 0 Warnings: 0 mysql> SELECT name, HEX(name), LENGTH(name), CHAR_LENGTH(name) FROM t1; +----------+------------------+--------------+-------------------+ | name | HEX(name) | LENGTH(name) | CHAR_LENGTH(name) | +----------+------------------+--------------+-------------------+ | Daniël | 44616E69C3AB6C | 7 | 6 | | Daniël | 44616E6965CC886C | 8 | 7 | +----------+------------------+--------------+-------------------+ 2 rows in set (0.00 sec) mysql> SELECT name, HEX(name), LENGTH(name), CHAR_LENGTH(name) FROM t1 WHERE name='Daniël'; +---------+----------------+--------------+-------------------+ | name | HEX(name) | LENGTH(name) | CHAR_LENGTH(name) | +---------+----------------+--------------+-------------------+ | Daniël | 44616E69C3AB6C | 7 | 6 | +---------+----------------+--------------+-------------------+ 1 row in set (0.00 sec) mysql> SELECT name, HEX(name), LENGTH(name), CHAR_LENGTH(name) FROM t1 WHERE name='Daniel'; +---------+----------------+--------------+-------------------+ | name | HEX(name) | LENGTH(name) | CHAR_LENGTH(name) | +---------+----------------+--------------+-------------------+ | Daniël | 44616E69C3AB6C | 7 | 6 | +---------+----------------+--------------+-------------------+ 1 row in set (0.00 sec)
[4 Feb 2014 14:15]
Peter Laursen
Not same thing happens when storage in a table is involved and when not. See SELECT LENGTH(UNHEX('44616E69C3AB6C')) a, LENGTH(UNHEX('44616E6965CC886C')) b, CHAR_LENGTH(UNHEX('44616E69C3AB6C')) c, CHAR_LENGTH(UNHEX('44616E6965CC886C')) d; /* a b c d ------ ------ ------ -------- 7 8 7 8 */ -- But CREATE TABLE `t1` (`name` VARCHAR(100)) DEFAULT CHARSET=utf8; -- same with utf8mb4 INSERT INTO t1 VALUES(UNHEX('44616E69C3AB6C')),(UNHEX('44616E6965CC886C')); SELECT HEX(NAME), LENGTH(NAME), CHAR_LENGTH(NAME) FROM t1; /* HEX(name) LENGTH(name) CHAR_LENGTH(name) ---------------- ------------ ------------------- 44616E69C3AB6C 7 6 44616E6965CC886C 8 7 */ (and in both cases it is the same no matter if I SET NAMES latin1, utf8 or utf8mb4 - as long as I don't attempt to print the unhexed string as text, of course) I use to claim that I understand charsets/encodings in MySQL well. But this I don't understand" What is this 'combined character' with HEX-value '65CC88'?
[4 Feb 2014 14:29]
Peter Laursen
I was testing on Windows7/64, BTW.
[4 Feb 2014 14:33]
Daniël van Eeden
For some background: http://en.wikipedia.org/wiki/Combining_character 65CC88 = 65 + CC88 65 = e CC88 = similar to " so e + " = ë So both 65CC88 and C3AB are the same character, but use a different way of constructing the character. A similar issue: mysql> SELECT UNHEX('68C4B3'), UNHEX('68696A'), UNHEX('68C4B3')=UNHEX('68696A'); +-----------------+-----------------+---------------------------------+ | UNHEX('68C4B3') | UNHEX('68696A') | UNHEX('68C4B3')=UNHEX('68696A') | +-----------------+-----------------+---------------------------------+ | hij | hij | 0 | +-----------------+-----------------+---------------------------------+ 1 row in set (0.00 sec)
[4 Feb 2014 14:46]
Daniël van Eeden
Example with python: Python 3.3.2+ (default, Oct 9 2013, 14:50:09) [GCC 4.8.1] on linux Type "help", "copyright", "credits" or "license" for more information. >>> a = b'\x65\xcc\x88' >>> a.decode('utf8') 'ë' >>> b = b'\xc3\xab' >>> b.decode('utf8') 'ë' >>> a.decode('utf8') == b.decode('utf8') False >>> import unicodedata >>> unicodedata.normalize('NFC', a.decode('utf8')) == unicodedata.normalize('NFC', b.decode('utf8')) True >>>
[4 Feb 2014 14:49]
Peter Laursen
And some Wiki links and explanations: http://en.wikipedia.org/wiki/Combining_character "This leads to a requirement to perform Unicode normalization before comparing two Unicode strings and to carefully design encoding converters to correctly map all of the valid ways to represent a character in Unicode to a legacy encoding to avoid data loss." http://en.wikipedia.org/wiki/Unicode_normalization "Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other."
[7 Feb 2014 18:14]
Sveta Smirnova
Thank you for the report. Verified as described. Same thing happens with utf8mb4
[10 Feb 2014 12:18]
MySQL Verification Team
This seems related to my bug already filed: Bug 14271638 - CHARACTER SET DUPLICATE HANDLING SEEMS BROKEN
[6 Dec 2015 11:18]
Daniël van Eeden
This is handles in th SQL standard: http://justatheory.com/computers/databases/postgresql/unicode-normalization.html So SELECT NORMALIZE(X'44616E69C3AB6C',NFC) = NORMALIZE(X'44616E6965CC886C',NFC) should evaluate to TRUE. See also: http://www.wiscorp.com/sql20nn.zip 7IWD2-02-Foundation-2011-12.pdf 8.12 <normalized predicate> And from page 6.30 <string value function> ----- <normalize function> ::= NORMALIZE <left paren> <character value expression> [ <comma> <normal form> [ <comma> <normalize function result length> ] ] <right paren> ----- https://dveeden.github.io/modern-sql-in-mysql/
[6 Dec 2015 11:34]
Daniël van Eeden
This is directly related to WL#2048 https://dev.mysql.com/worklog/task/?id=2048
[8 Dec 2017 14:02]
Daniël van Eeden
Added 8.0