Bug #71625 lack of Unicode normalizatiosn also affects string comparison.
Submitted: 7 Feb 2014 13:21 Modified: 18 Jan 2018 13:19
Reporter: Peter Laursen (Basic Quality Contributor) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S4 (Feature request)
Version:any OS:Any
Assigned to: CPU Architecture:Any

[7 Feb 2014 13:21] Peter Laursen
Description:
This is a folow-up by w bug reports by Daniel van Eeden: http://bugs.mysql.com/bug.php?id=71563 and http://bugs.mysql.com/bug.php?id=71564.

Daniel's reports complain about metadata issues (different string lengths wiht equivalent Unioce strings).

However it does not only affect metadata.  In string comparisons also data are affected.

How to repeat:
CREATE TABLE `t1` (
  `name` VARCHAR(100) DEFAULT NULL,
  `id` INT(11) DEFAULT NULL
) ENGINE=INNODB DEFAULT CHARSET=utf8;

INSERT INTO t1 VALUES(UNHEX('44616E69C3AB6C'), 1),(UNHEX('44616E6965CC886C'), 2);

SELECT (SELECT `name` FROM t1 WHERE id= 1 ) = (SELECT `NAME` FROM t1 WHERE id = 2);
-- returns 0 (false)

Suggested fix:
I don't think that "combining characters" should always be considered identical. 

Also because checking for such in every string comparison and everytime returning metadata to a client will undoubtedly have negative performance impact as there must be several hundreds of combinations to check for. I also don't know if/where a complete resource for this can be found.

Maybe a SQL_mode for this? Or supported by specific collations (for instance let utf8_unicode.ci and not utf8_general_ci do so)?

The primary reason for posting this is just to extend Daniel's report.  It is not only about metadata.
[7 Feb 2014 13:24] Peter Laursen
Oops .. first sentence was goofed up! 

This is a follow-up to 2 bug reports .. I meant!
[7 Feb 2014 18:21] Sveta Smirnova
Thank you for the report.

According to http://collation-charts.org/mysql60/mysql604.utf8_general_ci.european.html this is not a bug: C3AB is not equal to 65CC88
[7 Feb 2014 18:39] Peter Laursen
If this is so easy why are 

http://bugs.mysql.com/bug.php?id=71563
and 
http://bugs.mysql.com/bug.php?id=71564
.. then also not closed as 'not a bug' several days ago?

Besides I find it completely ridicolous to reject this bug report with reference to a documentation page that does nothing but document current behavior. It makes no sense as I am complaining about current behavior - ie. lack of any option (a sql_mode, specific collations, whatever) that make it possible to *compare unicode characters as equal* who also *print as equal*. This could be relevant if data are imported to the database from different sources using different ways to encode accented characters.

At least please verify as *BOTH* a documentation request ("MySQL charsets and colations do not consider unicode combined charaters - ie. printing a_basic_character + a_backspace + an_accent - as a single character") *AND* a feature request ("there should be an option to compare characters that print the same also to compare as equal in string comparisons as well as deliver the same metadata such as string length").
[7 Feb 2014 18:57] Sveta Smirnova
Thank you for the feedback.

Bug #71563 and bug #71564 speak about wrong results and wrong formatting, but not about wrong sort order. But you are correct: they are technically feature requests still.

I can verify this report as feature request "there should be an option to compare characters that print the same also to compare as equal in string comparisons as well as deliver the same metadata such as string length".

Please open separate bug report about lack of documentation.
[10 Feb 2014 9:20] Peter Laursen
Thanks for verification.

Being not too proliferant in server internals, I now think (after sleeping on it for a few days) that this is simply a request for collations that handle multiple byte sequences resulting in same character as identical (in string comparisons and in metadata).

(BTW: Vietnamese will be a challenge, I think!)
[10 Feb 2014 12:07] Peter Laursen
Docs request posted at http://bugs.mysql.com/bug.php?id=71656
[18 Jan 2018 13:19] Erlend Dahl
[17 Jan 2018 23:52] Xing Z Zhang

Actually utf8_unicode_ci added in 5.0.44 can compare those kind of strings
correctly.