MySQL Bugs: #71656: Lack of 'combined characters' support in collations should be documented

Bug #71656	Lack of 'combined characters' support in collations should be documented
Submitted:	10 Feb 2014 12:06	Modified:	11 Feb 2014 15:53
Reporter:	Peter Laursen (Basic Quality Contributor)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Documentation	Severity:	S3 (Non-critical)
Version:	any	OS:	Any
Assigned to:	Paul DuBois	CPU Architecture:	Any

Description:
This is a follow-up on my report here http://bugs.mysql.com/bug.php?id=71625 (what was again triggered by Daniël van Eeden's reports at http://bugs.mysql.com/bug.php?id=71563 and http://bugs.mysql.com/bug.php?id=71564).

I was asked by Sveta to file a separate report for the documentation request.

As Daniël's report shows the character "ë" maybe be represented as both HEX "C3AB" (a single utf8 character including the accent) and HEX "65CC88" (a 'combined character' with an unaccented character and an accent specified sequentally).

I found some Wikipedia links.

http://en.wikipedia.org/wiki/Combining_character
"This leads to a requirement to perform Unicode normalization before comparing two Unicode strings and to carefully design encoding converters to correctly map all of the valid ways to represent a character in Unicode to a legacy encoding to avoid data loss."

http://en.wikipedia.org/wiki/Unicode_normalization
"Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other."

.. however this does not happen in MySQL. "ë" may be <> "ë" if originally specified with different HEX-patters by clients. And also the char_length() will return different (1 and 2 respectively with this example) as will the metadata sent from the server upon mysql_real_query(). This may cause indentation issues in clients and similar (at best). And it can all be very confusing as *data* display the same.

As far as I can understand this happens because all collations available currently do not consider the possibility of 'combined characters' for accented characters.

How to repeat:
see above.

Suggested fix:
Docs should (as long as this is still the case) state for unicode collations that 'combined characters' will be considered different from the same character written with a single unicode character in string comparisons. And char_length (from both char_length() function as in metadata of a result set) will also not be the same.

Thank you for the bug report.

Thank you for your bug report. This issue has been addressed in the documentation. The updated documentation will appear on our website shortly, and will be included in the next release of the relevant products.

http://dev.mysql.com/doc/refman/5.6/en/charset-unicode-sets.html said:

Also, combining marks are not fully supported. This affects primarily
Vietnamese, Yoruba, and some smaller languages such as Navajo.

I will modify that to:

Also, combining marks are not fully supported. This affects primarily
Vietnamese, Yoruba, and some smaller languages such as Navajo.
A combined character will be considered different from the same
character written with a single unicode character in string
comparisons, and the two characters are considered to have a
different length (for example, as returned by the CHAR_LENGTH()
function or in result set metadata).