| Bug #71563 | Handling of combining characters. | ||
|---|---|---|---|
| Submitted: | 3 Feb 2014 11:26 | Modified: | 8 Dec 2017 14:02 | 
| Reporter: | Daniël van Eeden (OCA) | Email Updates: | |
| Status: | Verified | Impact on me: | |
| Category: | MySQL Server: Charsets | Severity: | S4 (Feature request) | 
| Version: | 5.6.15, 5.7.9, 8.0 | OS: | Any | 
| Assigned to: | CPU Architecture: | Any | |
| Tags: | nfc, normalize, sql:2008, STANDARDS, Unicode, utf8 | ||
   [3 Feb 2014 11:26]
   Daniël van Eeden        
  
 
   [4 Feb 2014 12:57]
   MySQL Verification Team        
  Thank you for the bug report. Please provide the create table/insert data commands. Thanks.
   [4 Feb 2014 13:37]
   Daniël van Eeden        
  mysql> CREATE TABLE `t1` (`name` varchar(100)) DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.01 sec)
mysql> INSERT INTO t1 VALUES(UNHEX('44616E69C3AB6C')),(UNHEX('44616E6965CC886C'));
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0
mysql> SELECT name, HEX(name), LENGTH(name), CHAR_LENGTH(name) FROM t1;
+----------+------------------+--------------+-------------------+
| name     | HEX(name)        | LENGTH(name) | CHAR_LENGTH(name) |
+----------+------------------+--------------+-------------------+
| Daniël   | 44616E69C3AB6C   |            7 |                 6 |
| Daniël  | 44616E6965CC886C |            8 |                 7 |
+----------+------------------+--------------+-------------------+
2 rows in set (0.00 sec)
mysql> SELECT name, HEX(name), LENGTH(name), CHAR_LENGTH(name) FROM t1 WHERE name='Daniël';
+---------+----------------+--------------+-------------------+
| name    | HEX(name)      | LENGTH(name) | CHAR_LENGTH(name) |
+---------+----------------+--------------+-------------------+
| Daniël  | 44616E69C3AB6C |            7 |                 6 |
+---------+----------------+--------------+-------------------+
1 row in set (0.00 sec)
mysql> SELECT name, HEX(name), LENGTH(name), CHAR_LENGTH(name) FROM t1 WHERE name='Daniel';
+---------+----------------+--------------+-------------------+
| name    | HEX(name)      | LENGTH(name) | CHAR_LENGTH(name) |
+---------+----------------+--------------+-------------------+
| Daniël  | 44616E69C3AB6C |            7 |                 6 |
+---------+----------------+--------------+-------------------+
1 row in set (0.00 sec)
 
   [4 Feb 2014 14:15]
   Peter Laursen        
  Not same thing happens when storage in a table is involved and when not. See
SELECT 
        LENGTH(UNHEX('44616E69C3AB6C')) a, 
        LENGTH(UNHEX('44616E6965CC886C')) b, 
        CHAR_LENGTH(UNHEX('44616E69C3AB6C')) c, 
        CHAR_LENGTH(UNHEX('44616E6965CC886C')) d;
/*
     a       b       c       d  
------  ------  ------  --------
     7       8       7         8
*/
-- But 
CREATE TABLE `t1` (`name` VARCHAR(100)) DEFAULT CHARSET=utf8; -- same with utf8mb4
INSERT INTO t1 VALUES(UNHEX('44616E69C3AB6C')),(UNHEX('44616E6965CC886C'));
SELECT HEX(NAME), LENGTH(NAME), CHAR_LENGTH(NAME) FROM t1;
/*
HEX(name)         LENGTH(name)  CHAR_LENGTH(name)  
----------------  ------------  -------------------
44616E69C3AB6C               7                    6
44616E6965CC886C             8                    7
*/
(and in both cases it is the same no matter if I SET NAMES latin1, utf8 or utf8mb4  - as long as I don't attempt to print the unhexed string as text, of course)
I use to claim that I understand charsets/encodings in MySQL well. But this I don't understand" What is this 'combined character' with HEX-value '65CC88'?
 
   [4 Feb 2014 14:29]
   Peter Laursen        
  I was testing on Windows7/64, BTW.
   [4 Feb 2014 14:33]
   Daniël van Eeden        
  For some background: http://en.wikipedia.org/wiki/Combining_character 65CC88 = 65 + CC88 65 = e CC88 = similar to " so e + " = ë So both 65CC88 and C3AB are the same character, but use a different way of constructing the character. A similar issue: mysql> SELECT UNHEX('68C4B3'), UNHEX('68696A'), UNHEX('68C4B3')=UNHEX('68696A'); +-----------------+-----------------+---------------------------------+ | UNHEX('68C4B3') | UNHEX('68696A') | UNHEX('68C4B3')=UNHEX('68696A') | +-----------------+-----------------+---------------------------------+ | hij | hij | 0 | +-----------------+-----------------+---------------------------------+ 1 row in set (0.00 sec)
   [4 Feb 2014 14:46]
   Daniël van Eeden        
  Example with python:
Python 3.3.2+ (default, Oct  9 2013, 14:50:09) 
[GCC 4.8.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> a = b'\x65\xcc\x88'
>>> a.decode('utf8')
'ë'
>>> b = b'\xc3\xab'
>>> b.decode('utf8')
'ë'
>>> a.decode('utf8') == b.decode('utf8')
False
>>> import unicodedata
>>> unicodedata.normalize('NFC', a.decode('utf8')) == unicodedata.normalize('NFC', b.decode('utf8'))
True
>>>
 
   [4 Feb 2014 14:49]
   Peter Laursen        
  And some Wiki links and explanations: http://en.wikipedia.org/wiki/Combining_character "This leads to a requirement to perform Unicode normalization before comparing two Unicode strings and to carefully design encoding converters to correctly map all of the valid ways to represent a character in Unicode to a legacy encoding to avoid data loss." http://en.wikipedia.org/wiki/Unicode_normalization "Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other."
   [7 Feb 2014 18:14]
   Sveta Smirnova        
  Thank you for the report. Verified as described. Same thing happens with utf8mb4
   [10 Feb 2014 12:18]
   MySQL Verification Team        
  This seems related to my bug already filed: Bug 14271638 - CHARACTER SET DUPLICATE HANDLING SEEMS BROKEN
   [6 Dec 2015 11:18]
   Daniël van Eeden        
  This is handles in th SQL standard: http://justatheory.com/computers/databases/postgresql/unicode-normalization.html So SELECT NORMALIZE(X'44616E69C3AB6C',NFC) = NORMALIZE(X'44616E6965CC886C',NFC) should evaluate to TRUE. See also: http://www.wiscorp.com/sql20nn.zip 7IWD2-02-Foundation-2011-12.pdf 8.12 <normalized predicate> And from page 6.30 <string value function> ----- <normalize function> ::= NORMALIZE <left paren> <character value expression> [ <comma> <normal form> [ <comma> <normalize function result length> ] ] <right paren> ----- https://dveeden.github.io/modern-sql-in-mysql/
   [6 Dec 2015 11:34]
   Daniël van Eeden        
  This is directly related to WL#2048 https://dev.mysql.com/worklog/task/?id=2048
   [8 Dec 2017 14:02]
   Daniël van Eeden        
  Added 8.0
