MySQL Bugs: #71625: lack of Unicode normalizatiosn also affects string comparison.

Bug #71625	lack of Unicode normalizatiosn also affects string comparison.
Submitted:	7 Feb 2014 13:21	Modified:	18 Jan 2018 13:19
Reporter:	Peter Laursen (Basic Quality Contributor)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S4 (Feature request)
Version:	any	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
This is a folow-up by w bug reports by Daniel van Eeden: http://bugs.mysql.com/bug.php?id=71563 and http://bugs.mysql.com/bug.php?id=71564.

Daniel's reports complain about metadata issues (different string lengths wiht equivalent Unioce strings).

However it does not only affect metadata.  In string comparisons also data are affected.

How to repeat:
CREATE TABLE `t1` (
  `name` VARCHAR(100) DEFAULT NULL,
  `id` INT(11) DEFAULT NULL
) ENGINE=INNODB DEFAULT CHARSET=utf8;

INSERT INTO t1 VALUES(UNHEX('44616E69C3AB6C'), 1),(UNHEX('44616E6965CC886C'), 2);

SELECT (SELECT `name` FROM t1 WHERE id= 1 ) = (SELECT `NAME` FROM t1 WHERE id = 2);
-- returns 0 (false)

Suggested fix:
I don't think that "combining characters" should always be considered identical. 

Also because checking for such in every string comparison and everytime returning metadata to a client will undoubtedly have negative performance impact as there must be several hundreds of combinations to check for. I also don't know if/where a complete resource for this can be found.

Maybe a SQL_mode for this? Or supported by specific collations (for instance let utf8_unicode.ci and not utf8_general_ci do so)?

The primary reason for posting this is just to extend Daniel's report.  It is not only about metadata.

Oops .. first sentence was goofed up! 

This is a follow-up to 2 bug reports .. I meant!

Thank you for the report.

According to http://collation-charts.org/mysql60/mysql604.utf8_general_ci.european.html this is not a bug: C3AB is not equal to 65CC88

If this is so easy why are 

http://bugs.mysql.com/bug.php?id=71563
and 
http://bugs.mysql.com/bug.php?id=71564
.. then also not closed as 'not a bug' several days ago?

Besides I find it completely ridicolous to reject this bug report with reference to a documentation page that does nothing but document current behavior. It makes no sense as I am complaining about current behavior - ie. lack of any option (a sql_mode, specific collations, whatever) that make it possible to *compare unicode characters as equal* who also *print as equal*. This could be relevant if data are imported to the database from different sources using different ways to encode accented characters.

At least please verify as *BOTH* a documentation request ("MySQL charsets and colations do not consider unicode combined charaters - ie. printing a_basic_character + a_backspace + an_accent - as a single character") *AND* a feature request ("there should be an option to compare characters that print the same also to compare as equal in string comparisons as well as deliver the same metadata such as string length").

Thank you for the feedback.

Bug #71563 and bug #71564 speak about wrong results and wrong formatting, but not about wrong sort order. But you are correct: they are technically feature requests still.

I can verify this report as feature request "there should be an option to compare characters that print the same also to compare as equal in string comparisons as well as deliver the same metadata such as string length".

Please open separate bug report about lack of documentation.

Thanks for verification.

Being not too proliferant in server internals, I now think (after sleeping on it for a few days) that this is simply a request for collations that handle multiple byte sequences resulting in same character as identical (in string comparisons and in metadata).

(BTW: Vietnamese will be a challenge, I think!)

Docs request posted at http://bugs.mysql.com/bug.php?id=71656

[17 Jan 2018 23:52] Xing Z Zhang

Actually utf8_unicode_ci added in 5.0.44 can compare those kind of strings
correctly.