Bug #10252 | max_sort_length causes incorrect results with utf8 char columns | ||
---|---|---|---|
Submitted: | 29 Apr 2005 0:57 | Modified: | 25 Apr 2006 0:30 |
Reporter: | Peter Gulutzan | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Server | Severity: | S3 (Non-critical) |
Version: | 5.0.6-beta-debug | OS: | Linux (SUSE 9.2) |
Assigned to: | CPU Architecture: | Any |
[29 Apr 2005 0:57]
Peter Gulutzan
[14 Jul 2005 13:14]
Alexander Barkov
This is not a bug. For UTF8, a sort key uses exactly two bytes per chararacter, no matter how many bytes are used by the character itself: 1,2,3. Thus, only two characters weights can fit into 5 bytes. The manual says "Only the first max_sort_length bytes of each value are used; the rest are ignored". This statement should be changed, and written in more details. To doc team: feel free to ask me if you have any additional questions.
[14 Jul 2005 13:30]
Alexander Barkov
Just to clarify my previous comment: max_sort_length limits not the length of the original data. It limits length of the sort key, corresponding to the data. In latin1_swedish_ci, sort key is just LCASE transformation of the original string. Lengths of the data itself and sort keys are the same. In latin1_german2_ci, there are some tricky characters whose sort keys use two bytes: key for "ä" is "ae". key for "ö" is "oe" key for "ü" is "ue" key for "ß" is "ss" Thus if you have max_sort_length=1, the second byte of the keys will not be used, and these letters will be sorted like "a","o","u" and "s". All other letters use one byte for sort key in latin1_german2_ci. utf8_general_ci uses exactly two bytes per one character in sort key. utf8_unicode_ci uses variable length weights, from 1 to 8 bytes per character. Each collation can handle it in its own way.
[27 Jul 2005 22:47]
Mike Hillyer
Changed to Not a Bug from Documenting because this bug does not need a changelog entry.
[27 Jul 2005 23:03]
Mike Hillyer
Re-opened and assigned to me.