Bug #71656 Lack of 'combined characters' support in collations should be documented
Submitted: 10 Feb 2014 12:06 Modified: 11 Feb 2014 15:53
Reporter: Peter Laursen (Basic Quality Contributor) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Documentation Severity:S3 (Non-critical)
Version:any OS:Any
Assigned to: Paul DuBois CPU Architecture:Any

[10 Feb 2014 12:06] Peter Laursen
Description:
This is a follow-up on my report here http://bugs.mysql.com/bug.php?id=71625 (what was again triggered by Daniël van Eeden's reports at http://bugs.mysql.com/bug.php?id=71563 and http://bugs.mysql.com/bug.php?id=71564).

I was asked by Sveta to file a separate report for the documentation request.

As Daniël's report shows the character "ë" maybe be represented as both HEX "C3AB" (a single utf8 character including the accent) and HEX "65CC88" (a 'combined character' with an unaccented character and an accent specified sequentally). 

I found some Wikipedia links.

http://en.wikipedia.org/wiki/Combining_character
"This leads to a requirement to perform Unicode normalization before comparing two Unicode strings and to carefully design encoding converters to correctly map all of the valid ways to represent a character in Unicode to a legacy encoding to avoid data loss."

http://en.wikipedia.org/wiki/Unicode_normalization
"Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other."

.. however this does not happen in MySQL.  "ë" may be <> "ë" if originally specified with different HEX-patters by clients. And also the char_length() will return different (1 and 2 respectively with this example) as will the metadata sent from the server upon mysql_real_query(). This may cause indentation issues in clients and similar (at best). And it can all be very confusing as *data* display the same.

As far as I can understand this happens because all collations available currently do not consider the possibility of 'combined characters' for accented characters. 

How to repeat:
see above. 

Suggested fix:
Docs should (as long as this is still the case) state for unicode collations that 'combined characters' will be considered different from the same character written with a single unicode character in string comparisons. And char_length (from both char_length() function as in metadata of a result set) will also not be the same.
[10 Feb 2014 12:08] MySQL Verification Team
Thank you for the bug report.
[11 Feb 2014 15:53] Paul DuBois
Thank you for your bug report. This issue has been addressed in the documentation. The updated documentation will appear on our website shortly, and will be included in the next release of the relevant products.

http://dev.mysql.com/doc/refman/5.6/en/charset-unicode-sets.html said:

Also, combining marks are not fully supported. This affects primarily
Vietnamese, Yoruba, and some smaller languages such as Navajo.

I will modify that to:

Also, combining marks are not fully supported. This affects primarily
Vietnamese, Yoruba, and some smaller languages such as Navajo.
A combined character will be considered different from the same
character written with a single unicode character in string
comparisons, and the two characters are considered to have a
different length (for example, as returned by the CHAR_LENGTH()
function or in result set metadata).