Bug #18749 Normalize Decomposed Characters in FULLTEXT Indexes
Submitted: 3 Apr 2006 16:26 Modified: 24 Apr 2006 9:45
Reporter: Chris Calender Email Updates:
Status: Verified Impact on me:
Category:MySQL Server: Charsets Severity:S4 (Feature request)
Version:4.1, 5.0, 5.1 OS:Any
Assigned to: Assigned Account
Triage: Triaged: D5 (Feature request)

[3 Apr 2006 16:26] Chris Calender
utf8 diacritics can be stored in two different forms: 
- composed: Ö (one UTF8 character)
- decomposed: O" (two UTF8 characters)

If you have 'decomposed' form (2-char) for some values and 'composed' for others, then you will have a mixture of composed and decomposed characters.

Them, in searches, you cannot find the 'decomposed' UTF8 characters.

This is because decomposed characters are not normalized when put to full text index.  The temporary work-around is to put normalized characters into table and/or provide decomposed characters in the query.

The customer has erquested a feature request that will automatically normalize decomposed characters when they're put in a FULLTEXT index.

How to repeat:
See above description.
[24 Apr 2006 9:45] Valerii Kravchuk
Thank you for a reasonable feature request. I hope, it will be implemented some day.
[17 May 2006 9:12] Sergei Golubchik
This will hardly be implemented in the FULLTEXT index.
It compares strings according to collation rules.
The correct solution is to fix unicode collation to compare Ö and O" as equal.