Bug #71563 Handling of combining characters.
Submitted: 3 Feb 2014 11:26 Modified: 8 Dec 2017 14:02
Reporter: Daniël van Eeden (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Charsets Severity:S4 (Feature request)
Version:5.6.15, 5.7.9, 8.0 OS:Any
Assigned to: CPU Architecture:Any
Tags: nfc, normalize, sql:2008, STANDARDS, Unicode, utf8

[3 Feb 2014 11:26] Daniël van Eeden
Description:
The length of a string with combining characters is not calculated correctly.

How to repeat:
mysql> SELECT name, HEX(name), LENGTH(name), CHAR_LENGTH(name) FROM t1;
+-----------+--------------------+--------------+-------------------+
| name      | HEX(name)          | LENGTH(name) | CHAR_LENGTH(name) |
+-----------+--------------------+--------------+-------------------+
| Daniël    | 44616E69C3AB6C     |            7 |                 6 |
| Daniël   | 44616E6965CC886C   |            8 |                 7 |
+-----------+--------------------+--------------+-------------------+
2 rows in set (0.00 sec)

mysql> SELECT name, HEX(name), LENGTH(name), CHAR_LENGTH(name) FROM t1 WHERE name='Daniël';
+---------+----------------+--------------+-------------------+
| name    | HEX(name)      | LENGTH(name) | CHAR_LENGTH(name) |
+---------+----------------+--------------+-------------------+
| Daniël  | 44616E69C3AB6C |            7 |                 6 |
+---------+----------------+--------------+-------------------+
1 row in set (0.00 sec)

mysql> SELECT name, HEX(name), LENGTH(name), CHAR_LENGTH(name) FROM t1 WHERE name='Daniel';
+---------+----------------+--------------+-------------------+
| name    | HEX(name)      | LENGTH(name) | CHAR_LENGTH(name) |
+---------+----------------+--------------+-------------------+
| Daniël  | 44616E69C3AB6C |            7 |                 6 |
+---------+----------------+--------------+-------------------+
1 row in set (0.00 sec)

Suggested fix:
perform unicode normalization before calculating the length of the string.
[4 Feb 2014 12:57] MySQL Verification Team
Thank you for the bug report. Please provide the create table/insert data commands. Thanks.
[4 Feb 2014 13:37] Daniël van Eeden
mysql> CREATE TABLE `t1` (`name` varchar(100)) DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.01 sec)

mysql> INSERT INTO t1 VALUES(UNHEX('44616E69C3AB6C')),(UNHEX('44616E6965CC886C'));
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> SELECT name, HEX(name), LENGTH(name), CHAR_LENGTH(name) FROM t1;
+----------+------------------+--------------+-------------------+
| name     | HEX(name)        | LENGTH(name) | CHAR_LENGTH(name) |
+----------+------------------+--------------+-------------------+
| Daniël   | 44616E69C3AB6C   |            7 |                 6 |
| Daniël  | 44616E6965CC886C |            8 |                 7 |
+----------+------------------+--------------+-------------------+
2 rows in set (0.00 sec)

mysql> SELECT name, HEX(name), LENGTH(name), CHAR_LENGTH(name) FROM t1 WHERE name='Daniël';
+---------+----------------+--------------+-------------------+
| name    | HEX(name)      | LENGTH(name) | CHAR_LENGTH(name) |
+---------+----------------+--------------+-------------------+
| Daniël  | 44616E69C3AB6C |            7 |                 6 |
+---------+----------------+--------------+-------------------+
1 row in set (0.00 sec)

mysql> SELECT name, HEX(name), LENGTH(name), CHAR_LENGTH(name) FROM t1 WHERE name='Daniel';
+---------+----------------+--------------+-------------------+
| name    | HEX(name)      | LENGTH(name) | CHAR_LENGTH(name) |
+---------+----------------+--------------+-------------------+
| Daniël  | 44616E69C3AB6C |            7 |                 6 |
+---------+----------------+--------------+-------------------+
1 row in set (0.00 sec)
[4 Feb 2014 14:15] Peter Laursen
Not same thing happens when storage in a table is involved and when not. See

SELECT 
        LENGTH(UNHEX('44616E69C3AB6C')) a, 
        LENGTH(UNHEX('44616E6965CC886C')) b, 
        CHAR_LENGTH(UNHEX('44616E69C3AB6C')) c, 
        CHAR_LENGTH(UNHEX('44616E6965CC886C')) d;
/*
     a       b       c       d  
------  ------  ------  --------
     7       8       7         8
*/

-- But 

CREATE TABLE `t1` (`name` VARCHAR(100)) DEFAULT CHARSET=utf8; -- same with utf8mb4
INSERT INTO t1 VALUES(UNHEX('44616E69C3AB6C')),(UNHEX('44616E6965CC886C'));
SELECT HEX(NAME), LENGTH(NAME), CHAR_LENGTH(NAME) FROM t1;
/*
HEX(name)         LENGTH(name)  CHAR_LENGTH(name)  
----------------  ------------  -------------------
44616E69C3AB6C               7                    6
44616E6965CC886C             8                    7
*/

(and in both cases it is the same no matter if I SET NAMES latin1, utf8 or utf8mb4  - as long as I don't attempt to print the unhexed string as text, of course)

I use to claim that I understand charsets/encodings in MySQL well. But this I don't understand" What is this 'combined character' with HEX-value '65CC88'?
[4 Feb 2014 14:29] Peter Laursen
I was testing on Windows7/64, BTW.
[4 Feb 2014 14:33] Daniël van Eeden
For some background: http://en.wikipedia.org/wiki/Combining_character

65CC88 = 65 + CC88
65 = e
CC88 = similar to "
so e + " = ë

So both 65CC88 and C3AB are the same character, but use a different way of constructing the character.

A similar issue:
mysql> SELECT UNHEX('68C4B3'), UNHEX('68696A'), UNHEX('68C4B3')=UNHEX('68696A'); 
+-----------------+-----------------+---------------------------------+
| UNHEX('68C4B3') | UNHEX('68696A') | UNHEX('68C4B3')=UNHEX('68696A') |
+-----------------+-----------------+---------------------------------+
| hij              | hij             |                               0 |
+-----------------+-----------------+---------------------------------+
1 row in set (0.00 sec)
[4 Feb 2014 14:46] Daniël van Eeden
Example with python:

Python 3.3.2+ (default, Oct  9 2013, 14:50:09) 
[GCC 4.8.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> a = b'\x65\xcc\x88'
>>> a.decode('utf8')
'ë'
>>> b = b'\xc3\xab'
>>> b.decode('utf8')
'ë'
>>> a.decode('utf8') == b.decode('utf8')
False
>>> import unicodedata
>>> unicodedata.normalize('NFC', a.decode('utf8')) == unicodedata.normalize('NFC', b.decode('utf8'))
True
>>>
[4 Feb 2014 14:49] Peter Laursen
And some Wiki links and explanations:

http://en.wikipedia.org/wiki/Combining_character
"This leads to a requirement to perform Unicode normalization before comparing two Unicode strings and to carefully design encoding converters to correctly map all of the valid ways to represent a character in Unicode to a legacy encoding to avoid data loss."

http://en.wikipedia.org/wiki/Unicode_normalization
"Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other."
[7 Feb 2014 18:14] Sveta Smirnova
Thank you for the report.

Verified as described. Same thing happens with utf8mb4
[10 Feb 2014 12:18] MySQL Verification Team
This seems related to my bug already filed:
Bug 14271638 - CHARACTER SET DUPLICATE HANDLING SEEMS BROKEN
[6 Dec 2015 11:18] Daniël van Eeden
This is handles in th SQL standard:
http://justatheory.com/computers/databases/postgresql/unicode-normalization.html

So SELECT NORMALIZE(X'44616E69C3AB6C',NFC) = NORMALIZE(X'44616E6965CC886C',NFC) should evaluate to TRUE.

See also:
http://www.wiscorp.com/sql20nn.zip
7IWD2-02-Foundation-2011-12.pdf
8.12 <normalized predicate>

And from page 6.30 <string value function>
-----
<normalize function> ::=
NORMALIZE <left paren> <character value expression>
[ <comma> <normal form> [ <comma> <normalize function result length> ] ] <right paren>
-----

https://dveeden.github.io/modern-sql-in-mysql/
[6 Dec 2015 11:34] Daniël van Eeden
This is directly related to WL#2048
https://dev.mysql.com/worklog/task/?id=2048
[8 Dec 2017 14:02] Daniël van Eeden
Added 8.0