MySQL Bugs: #71563: Handling of combining characters.

Bug #71563	Handling of combining characters.
Submitted:	3 Feb 2014 11:26	Modified:	8 Dec 2017 14:02
Reporter:	Daniël van Eeden (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S4 (Feature request)
Version:	5.6.15, 5.7.9, 8.0	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	nfc, normalize, sql:2008, STANDARDS, Unicode, utf8

Description:
The length of a string with combining characters is not calculated correctly.

How to repeat:
mysql> SELECT name, HEX(name), LENGTH(name), CHAR_LENGTH(name) FROM t1;
+-----------+--------------------+--------------+-------------------+
| name      | HEX(name)          | LENGTH(name) | CHAR_LENGTH(name) |
+-----------+--------------------+--------------+-------------------+
| Daniël    | 44616E69C3AB6C     |            7 |                 6 |
| Daniël   | 44616E6965CC886C   |            8 |                 7 |
+-----------+--------------------+--------------+-------------------+
2 rows in set (0.00 sec)

mysql> SELECT name, HEX(name), LENGTH(name), CHAR_LENGTH(name) FROM t1 WHERE name='Daniël';
+---------+----------------+--------------+-------------------+
| name    | HEX(name)      | LENGTH(name) | CHAR_LENGTH(name) |
+---------+----------------+--------------+-------------------+
| Daniël  | 44616E69C3AB6C |            7 |                 6 |
+---------+----------------+--------------+-------------------+
1 row in set (0.00 sec)

mysql> SELECT name, HEX(name), LENGTH(name), CHAR_LENGTH(name) FROM t1 WHERE name='Daniel';
+---------+----------------+--------------+-------------------+
| name    | HEX(name)      | LENGTH(name) | CHAR_LENGTH(name) |
+---------+----------------+--------------+-------------------+
| Daniël  | 44616E69C3AB6C |            7 |                 6 |
+---------+----------------+--------------+-------------------+
1 row in set (0.00 sec)

Suggested fix:
perform unicode normalization before calculating the length of the string.

Thank you for the bug report. Please provide the create table/insert data commands. Thanks.

mysql> CREATE TABLE `t1` (`name` varchar(100)) DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.01 sec)

mysql> INSERT INTO t1 VALUES(UNHEX('44616E69C3AB6C')),(UNHEX('44616E6965CC886C'));
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> SELECT name, HEX(name), LENGTH(name), CHAR_LENGTH(name) FROM t1;
+----------+------------------+--------------+-------------------+
| name     | HEX(name)        | LENGTH(name) | CHAR_LENGTH(name) |
+----------+------------------+--------------+-------------------+
| Daniël   | 44616E69C3AB6C   |            7 |                 6 |
| Daniël  | 44616E6965CC886C |            8 |                 7 |
+----------+------------------+--------------+-------------------+
2 rows in set (0.00 sec)

mysql> SELECT name, HEX(name), LENGTH(name), CHAR_LENGTH(name) FROM t1 WHERE name='Daniël';
+---------+----------------+--------------+-------------------+
| name    | HEX(name)      | LENGTH(name) | CHAR_LENGTH(name) |
+---------+----------------+--------------+-------------------+
| Daniël  | 44616E69C3AB6C |            7 |                 6 |
+---------+----------------+--------------+-------------------+
1 row in set (0.00 sec)

mysql> SELECT name, HEX(name), LENGTH(name), CHAR_LENGTH(name) FROM t1 WHERE name='Daniel';
+---------+----------------+--------------+-------------------+
| name    | HEX(name)      | LENGTH(name) | CHAR_LENGTH(name) |
+---------+----------------+--------------+-------------------+
| Daniël  | 44616E69C3AB6C |            7 |                 6 |
+---------+----------------+--------------+-------------------+
1 row in set (0.00 sec)

Not same thing happens when storage in a table is involved and when not. See

SELECT 
        LENGTH(UNHEX('44616E69C3AB6C')) a, 
        LENGTH(UNHEX('44616E6965CC886C')) b, 
        CHAR_LENGTH(UNHEX('44616E69C3AB6C')) c, 
        CHAR_LENGTH(UNHEX('44616E6965CC886C')) d;
/*
     a       b       c       d  
------  ------  ------  --------
     7       8       7         8
*/

-- But 

CREATE TABLE `t1` (`name` VARCHAR(100)) DEFAULT CHARSET=utf8; -- same with utf8mb4
INSERT INTO t1 VALUES(UNHEX('44616E69C3AB6C')),(UNHEX('44616E6965CC886C'));
SELECT HEX(NAME), LENGTH(NAME), CHAR_LENGTH(NAME) FROM t1;
/*
HEX(name)         LENGTH(name)  CHAR_LENGTH(name)  
----------------  ------------  -------------------
44616E69C3AB6C               7                    6
44616E6965CC886C             8                    7
*/

(and in both cases it is the same no matter if I SET NAMES latin1, utf8 or utf8mb4  - as long as I don't attempt to print the unhexed string as text, of course)

I use to claim that I understand charsets/encodings in MySQL well. But this I don't understand" What is this 'combined character' with HEX-value '65CC88'?

I was testing on Windows7/64, BTW.

For some background: http://en.wikipedia.org/wiki/Combining_character

65CC88 = 65 + CC88
65 = e
CC88 = similar to "
so e + " = ë

So both 65CC88 and C3AB are the same character, but use a different way of constructing the character.

A similar issue:
mysql> SELECT UNHEX('68C4B3'), UNHEX('68696A'), UNHEX('68C4B3')=UNHEX('68696A'); 
+-----------------+-----------------+---------------------------------+
| UNHEX('68C4B3') | UNHEX('68696A') | UNHEX('68C4B3')=UNHEX('68696A') |
+-----------------+-----------------+---------------------------------+
| hĳ              | hij             |                               0 |
+-----------------+-----------------+---------------------------------+
1 row in set (0.00 sec)

Example with python:

Python 3.3.2+ (default, Oct  9 2013, 14:50:09) 
[GCC 4.8.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> a = b'\x65\xcc\x88'
>>> a.decode('utf8')
'ë'
>>> b = b'\xc3\xab'
>>> b.decode('utf8')
'ë'
>>> a.decode('utf8') == b.decode('utf8')
False
>>> import unicodedata
>>> unicodedata.normalize('NFC', a.decode('utf8')) == unicodedata.normalize('NFC', b.decode('utf8'))
True
>>>

And some Wiki links and explanations:

http://en.wikipedia.org/wiki/Combining_character
"This leads to a requirement to perform Unicode normalization before comparing two Unicode strings and to carefully design encoding converters to correctly map all of the valid ways to represent a character in Unicode to a legacy encoding to avoid data loss."

http://en.wikipedia.org/wiki/Unicode_normalization
"Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other."

Thank you for the report.

Verified as described. Same thing happens with utf8mb4

This seems related to my bug already filed:
Bug 14271638 - CHARACTER SET DUPLICATE HANDLING SEEMS BROKEN

This is handles in th SQL standard:
http://justatheory.com/computers/databases/postgresql/unicode-normalization.html

So SELECT NORMALIZE(X'44616E69C3AB6C',NFC) = NORMALIZE(X'44616E6965CC886C',NFC) should evaluate to TRUE.

See also:
http://www.wiscorp.com/sql20nn.zip
7IWD2-02-Foundation-2011-12.pdf
8.12 <normalized predicate>

And from page 6.30 <string value function>
-----
<normalize function> ::=
NORMALIZE <left paren> <character value expression>
[ <comma> <normal form> [ <comma> <normalize function result length> ] ] <right paren>
-----

https://dveeden.github.io/modern-sql-in-mysql/

This is directly related to WL#2048
https://dev.mysql.com/worklog/task/?id=2048

Added 8.0